Timeplus: Streaming Analytics for Realtime Data | Jove Zhong
Cloud CommuteSeptember 13, 2024x
29
00:29:4027.16 MB

Timeplus: Streaming Analytics for Realtime Data | Jove Zhong

Event-driven processing and real-time stream analytics are common design elements of modern, scalable architectures. Jove Zhong from Timeplus talks about how and why you want to use these architectural patterns and why Timeplus.

In this episode of Cloud Commute, Chris and Jove discuss:

  • What is Streaming Analytics?
  • Why do people prefer Timeplus?
  • What are the challenges with anything stream processing?
  • What does Jove believe the future holds?

Interested to learn more about the cloud infrastructure stack like storage, security, and Kubernetes? Head to our website (www.simplyblock.io/cloud-commute-podcast) for more episodes, and follow us on LinkedIn (www.linkedin.com/company/simplyblock-io/mycompany/). You can also check out the detailed show notes on Youtube (www.youtube.com/watch?v=z6Anit845ZM).

You can find Jove Zhong (Co-Founder and Head of Product at Timeplus) on Linkedin: https://www.linkedin.com/in/jovezhong

About simplyblock:

Simplyblock is an intelligent database storage orchestrator for IO-intensive workloads in Kubernetes, including databases and analytics solutions. It uses smart NVMe caching to speed up read I/O latency and queries. Single system connects local NVMe disks, GP3 volumes, and S3 making it easier to handle storage capacity and performance. With the benefits of thin provisioning, storage tiering, and volume pooling, your database workloads get better performance at lower cost without changes to existing AWS infrastructure.

👉 Get started with simplyblock: https://www.simplyblock.io/buy-now

🏪 simplyblock AWS Marketplace: https://aws.amazon.com/marketplace/seller-profile?id=seller-fzdtuccq3edzm

[00:00:00] You described it as a data platform. I think for a lot of people, they would probably describe it as a stream engine or a processing engine. Is that correct? This category is still in the earliest stage. So people all know what is database, what is data warehouse, that people will talk about data lake. The streaming processor power has been there for a long time. Like if you familiar with Apache Flink, for example, that's one of the best open source and it works well in many cases. And I think there's a huge developer community in Germany and in many other rest of the world. However, Flink is designed in a very kind of elegant and complicated way. And it's not easy to get started with Flink. And if you set up your own Flink cluster, it's not cheap, require a lot of CPU memory and tuning and overall experience is Flink is not great enough. That's part of the reason why maybe Apache Spark, it's a little bit more popular. Hello everyone. Welcome back to this week's episode of Simplyblock's [00:01:00] Cloud Commute Podcast. This week with another incredible guest. And yes, I know I say that every time and you know, it's every time true. So, hello, Jove. Thank you for being here. I think we'd never met before. So maybe just give me a quick introduction about you and your background. Hi, Chris, I'm so glad to join the show. And yeah, my name is Jove, Jove Zhong, and I'm the co founder at Timeplus. So Timeplus is a streaming database, or streaming SQL, or streaming analytics, depends on what kind of category you're coming from. So we are essentially providing a very unique capability for you to understand what's going on right now, and also what's in the past, even you can do some machine learning, real time training for the current data and the predictor for the future points. So we provide, both the open source, core engine, as well as some commercial software on the cloud, or bring your own cloud, or self hosting. And as a developer, you can connect Timeplus with your real time data feed. For example, data [00:02:00] in Apache Kafka, in some, for example, Postgres database, you can apply some CTC, and all those are real time data, and the hosted cross data can be put into Timeplus, and you can just leverage SQL to understand the pattern to some real time and low latency aggravations. And this can be quite useful for any kind of use cases, either a really low latency data points. For example, whether you are being attacked on the cybersecurity or if you do any kind of trading in the traditional financial sector or doing any Web3 blockchain, you might leverage our system to understand should I buy or sell more my portfolio, given the past few seconds, the price, the momentum. So, we are more like a general purpose data platform focusing on the real time part, but, can be applied to many use cases. And I'm happy to be part of the show, and I share my story and about some of the technical [00:03:00] details. Right. So you described it as a data platform. I think for a lot of people, they would probably describe it as a streaming, stream engine or stream and processing engine. Is that correct? Yeah. This category is still in the earliest stage, right? So people all know what is database, what is data warehouse. Then people talk about data lake. And the streaming processor part has been there for a long time. Like if you are familiar with Apache Flink, for example, that's one of the best open source stream processor. But- And it works well in many cases. And I think there's a huge developer community in Germany and, in, many rest of the world. However, Flink is designed in a very kind of elegant and complicated way, right? And it's not easy to get started with Flink. And if you set up your [00:04:00] own Flink cluster, it's not cheap, require a lot of CPU memory and tuning and overall experience of Flink is not great enough. That's part of the reason why maybe Apache Spark is a little bit more popular in the data world. Or maybe in terms of the real time processing or streaming processing, Spark is not as strong as Flink, but overall Spark has much easier developer experience and nice integration with Python, for example. And also it's backed by Databricks. So if you are a Databricks customer, you have a better version of Spark and you can do a bunch of other magical ML or data learning stuff in the Databricks platform. However, we think we can do something better than Flink. At least this is our motivation as a startup company. And we also, not just focusing on the processing part, but also we have our own storage [00:05:00] engine. So, it's okay you can just send the data to us, and no matter if it's fresh data or historical data. And we can leverage a single engine to allow you to ask questions about what's happening right now. And also what's the data trend or data pattern in the past two years. You don't have to send a query to different systems. Otherwise, you might have to send some data to, for example, Flink, and some other data to Snowflake, and you ask questions, and do some joining will be very difficult. And we are a single platform to handle both real time and historical data. So that's something we really want to make data engineering a little bit easier, especially that you care about low latency and also you need to join both the real time data and historical data. Right. And, as far as I know, TimePlus is all about SQL. So you extended basically the standard SQL engine or the standard SQL syntax, and added streaming capabilities to build all the queries from source to sync, [00:06:00] right? Yeah, SQL is, yeah. At the very beginning, it's our primary or even at the very beginning, it was the only interface. And then later on, we added a few more, like, user defined function, UDF, and then we can support, no matter, calling a remote service and all. You can leverage JavaScript or Python to define your own logic. But all those, customized logic is still defined as a function, as a SQL function, and then you still put this back to your overall SQL statement. So, for example, like unlike compared to Apache Flink, which is based on the low level API and SQL is, they are kind of an abstraction layer. So you write a SQL, but it will eventually compile or translate it to some low level API. But it's not the case for templates. SQL is our own API. There's no [00:07:00] underlying API. Data frame or type of API, but we do realize that is SQL can be very powerful. You can write a bunch of, say, CTE or subquery. But sometimes you do need to write more complex logic, or you can leverage your own JavaScript or Python libraries. So in that case, we enhance our UDF framework. So we can talk about that later. But that's really kind of something in the middle or more flexible way for advanced users to apply your own logic in your own SQL pipeline. So you can do many things. And so it's very efficient. In many cases, the JavaScript engine we choose, which is Google V8, it is as almost as fast as the native machine code. So you don't really sacrifice on performance. But SQL remains the main interface. Right. So you, already mentioned a few use cases in the beginning. But what, [00:08:00] are like the big things? I guess, real time dashboarding, financing, I think you mentioned, stock market, what are the big ones? Yeah, sure. So in terms of the use cases, there's many things we can do. And to be honest, some of them, or at least half of them, many other streaming processors or vendors can do. So it's bringing to the kind of overall picture that is- There's a lot of common use cases. For example, we call it streaming ETL. That is, you keep moving data from A to B, for example, like you have your original transactional data in Postgres or MySQL, right? And all those data you want to create a real time dashboard, or you want to create a real time alert, say, if my customer rate, average customer rate, today is lower than three, [00:09:00] then you need to notify the manager and do something, or if you, for example, if certain area you are offering some service and certain area they have not enough goods or inventory, you might want to send some alerts. So it is very natural to leverage the other system, not the original Postgres, to do all the analysis because you don't want to slow down your Postgres, right? So those are real time CDC. It is very common. And also, you want to do some real time alerts. You want to build your future store in terms of the machine learning, right? You want to grab all those original data source as many as possible, as fresh as possible and convert them into a bunch of numbers. And this act as a feature store so that you can apply more machine learning for your data scientists, data engineering team. So there are many things this industrial is focusing on. And there's [00:10:00] also open source solutions, commercial solutions. But what's really helped Timeplus stands out is we are very focusing on the performance and also it's a low footprint. So, I'll give you an example that is many systems can do per minute or maybe a few seconds data movement or transformation. But in terms of Timeplus, because we implement in a special way and we can easily achieve single digit millisecond end to end, meaning that is one, there's a data push to Timeplus, and with all those, streaming processing, or those secure logic, we can show the results to the downstream or maybe trigger alerts. And the entire end to end latency is maybe five or six milliseconds. So, this is quite useful in some [00:11:00] of the scenario that is you need a really low latency, for example, trading or risk analysis, right? I mean, if you are just looking at the dashboard or some other kind of a data consolidation, such a single digit millisecond, it's nice, but it may not be that really required. But for the case of treating, right? So, yeah. It's a very competitive space, right? Everyone wants to be faster. And you don't have to be super fast, but as long as you are faster than others, you have a better chance to win or lose less money, right? So everyone pushes very hard to have the best network cables and the low level package. They want to do everything they can do to achieve a better performance and lower latency. And it is true that we work with some of the best financial companies or even some of the people in blockchain or Web3 space. They leverage our technology to get to the low latency, but [00:12:00] also because we have this, we call streaming SQL or materialized view, which is kind of a new concept too, if you are not familiar with that, but essentially it's a long running SQL. It keep us scanning all the new data and the send you the results. And the SQL itself is a background job. It never stops. And keep working on that. So using that way, you don't really have to set intervals. I want to query my system every one minute or every two millisecond. You don't need to do that. You just define your logic in your SQL and whenever data comes, as low as like five millisecond, you will get a result, whether it's a good signal or it's a bad signal. Yeah. Yeah, it's basically a table, which is always up to date with the latest data that matched the query. Very, very useful stuff. I think in terms of trading, with [00:13:00] a high performance, it's probably mostly HFT. For people that don't know, it's like high frequency trading. Those guys are insane, like nuts, nuttily insane. Like they, I mean, I think they sell in like, they buy and sell in second orders. It's out of this world. Yeah, they don't really care too much about like what's the quarterly reports or what's going on Twitter. They just look at the number itself. Right. Exactly. I have a friend who's working a lot in HFT. And just listening to him is like, yeah, whatever. Make your money. I'm glad. In terms of sources and things, you already mentioned Postgres, and MySQL, I guess, Flink, Spark, the common ones. Is there anything like very special [00:14:00] that you say, yeah, that does work or that doesn't work? Yeah, for us, the most preferred data source is Apache Kafka, right? Again, it's right now there's many other Kafka API compatible services, right? They talk to the similar or even the same protocol as Apache Kafka. But you don't have to literally use Kafka. But it's still very common for you to leverage Kafka to consolidate to your data source. So, I mean, the point that is we don't have to talk to individual database, talk to individual API. And if there's already some integration to move those data to Apache Kafka, then we only talk to Kafka. That would be easier. And Kafka also supports a very nice, kind of a schema registry or different data compression or retention policies. And today, [00:15:00] for example, if you operate your own Apache Kafka, it's not so easy. So some people choose a manager service, such as Confirm Cloud. And also that's bring a room that is maybe some people complain, Confirm Cloud is not cheap. So that's the reason why people like, Redpanda or Webstream, they come up with their own solution matter using surplus or using Gola, or whether to use object storage or not. So they can bring down the cost a little, but may introduce a little more other things you want to worry about. For example, whether the latency media requirement, for example, upstream may not provide as good a low latency as others. But they do a very good job on the object storage leverage, so it's much cheaper. Right. But that means, if Kafka is your preferred data source, you either have your own connectors or you can basically connect anything that [00:16:00] Debezium supports as well. Yes. Yeah. Debezium is very popular for so called CDC, right? Moving data from the Original TP or TP database and the translate to a JSON format or similar format to capture what is changed. And Debezium can write data to, for example to a Kafka topic. And we can read it from there. It's actually, it's a very lancy and a compact structure, say, what is my schema? Which column is changed? What's the before value? What's the current value? Then we translate it to our own, no matter if it's a SQL or it's insert, so that we can capture what, almost a mirror in the real time, what's happening in the original database. However, Debezium also can be leveraged, I guess, without a Kafka, I think, in some [00:17:00] cases. You can even leverage Debezium as a library to fit into our system. But today, we still focus a lot on the Kafka data sources. In the meanwhile, we also have our own REST API for you to push data to us without something in the middle. And this can be particularly useful if you have a few, for example, IOT devices, right? So the IOT devices, maybe they can send it out to Kafka, but still it's, you need some Kafka library. But sending pure pan HTTP requests, it's easier. And the template itself have own buffering or some of the mechanism so that you can send as many data as possible to us. And then we can make it sure those data can be analyzed in real time and you don't really have to set up a Kafka. But if you, if your organization already have Kafka, then perfect. Then it's really a [00:18:00] nice way to consolidate your data and we don't worry. As a data engine, we don't have to worry too much about the consolidated part. Right, right. From a developer's perspective or operations perspective, I know there's timescale, time plus timescale, sorry, time, plus cloud. And as far as I know, you can also deploy it. I think there is even a homebrew, like a macOS homebrew version, but there's Docker images. There's everything. I mean, the homebrew version is, great for developers who want to try it out, who develop on their own machine. There's literally like no easier way to get something on macOS Maybe just talk a little bit about the different options. And what are the pros and cons between all of those? Yeah, sure. So the motivation is simple. That is, we really want to make the developer happy. At least, [00:19:00] let them, less frustrated, right? I mean, being a developer is not easy and you have to read a lot of documents and you have to set up dependencies. And for example, on my machine, I may have 10 different JDK versions. And depends on which software I use, I may have to keep a switch, even sometimes have to switch different node version or the Maven versions. That's a lot of capacity in the developer world. And we want to make things as easy as possible. So we want to provide the different options. So we have this core engine open sourced on GitHub. We call it Plotem or Templates Plotem. So this is our core engine. And it's as simple as a single binary. I think when you get it, it depends on your Linux or Mac. You might have this 200 or 500 or 300 megabytes, a single binary, a single file. And you just ChangeMode, PassX, you make this executable [00:20:00] and you can run that. There's no other way that is, it's super easy. But we also provide things like, homebrew. So that you don't have to download by yourself just using Homebrew. I personally run, I guess, brew updates every other day on my Mac to make sure all my software is up to date. It's more like clean up. It's cleaner than my room, I guess, I want to get to the latest room. It's very easy for you to get to the latest version using Homebrew. And also we have, for example, Docker image for sure. And you can also install our software using CURL. It's a script. But more importantly that is, if you want to run this in production, and we offer A is that is, we have this fully managed cloud. So you just need to log in with your social accounts, no matter it's Google or Microsoft. You can just log in and you may get, again, you can see some of our live demo systems, [00:21:00] that's for sure. But if you want to try by yourself, you can jump from your demo system to our cloud system and you can get a free trial for 14 days. Then that's fully managed by us and upgrade by us and tuning by us. And it's easier for sure. But if you want to do the self hosting by yourself for different reasons, for example, like you have your Kafka in local. You don't want to expose your Kafka endpoint to the internet, or you have some other databases. You want to have some data who are not ready to be put on the cloud. So feel free to use our self hosting version. We do provide a Kubernetes helm chart for you to install easily on your Kubernetes. Or if you want to just you can set up three or five different virtual VM. And you can install the binary and configure [00:22:00] them to create a class by yourself. So all those option is available and it really depends on the use cases. For example, we have some- We have normal customers who use the communities, the helm chart. That's very easy, but we also found some users. They just need to run the software on a small server, or even kind of a single node and that they have a lot of that. So it's more like similar to the edge computing, right? Each server, they have access to certain files and that they want to react on it quite quickly. Even we do have some even more classic edge scenario, which is that is, we have some clients doing POC with us that is, they have a lot of train, I guess. Like on the train and you not always get good signals, right? For example, if you are in the [00:23:00] countryside, if you go through some tunnels, so you do not always get a good signal. So they prefer doing the computing on the train. So every single train, they have a bunch of small devices. And the devices only have limited resources, like for example, 8 core and maybe 16 gigabyte memory. And in the past they want to put everything together. Like they want to put a Kafka, they want to put Apache Flink, they want to put a Clearhouse, want to put a RADIS together in such devices. And it's bringing a lot of issues. And now they can just put Timeplus. And Timeplus is a very efficient engine and they can do on device monitoring, training, alerting. Yeah. So that's a lot of deployment options for sure. But, yeah, feel free to choose which one you think works best for your scenario. All right. You said Timeplus Cloud. Are you using Kubernetes internally? Is it the same helm [00:24:00] chart? Not really helm chart. So, I guess. Again, this might, people might have different opinions, but in the practice of Timeplus, we use our own Kubernetes operator in the cloud. Again, it's Kubernetes, but we use, literally it's, EKS, right, AWS managed Elastic Kubernetes Service, I guess. So the reason that is, we also want to minimize our operation efforts. So, for example, the I guess the master node of, Kubernetes is managed by AWS. And they also have something like Fargate, right? You can less worry about the infra, the VM stuff, and the database also. They don't really provide most, recent or latest Kubernetes version. But they do a lot of tests and there are about a bunch of plugins. So we're happy to be using EKS. But the reason why we're [00:25:00] not directly using helm chart, but using operator, because we also want to handle multi talent and some other things better. And using operator essentially is customized code, right? So you can write your Golang code in the operator. And when someone sign up, when someone upgrade, we can do extra things. And helm chart is really designed for people who don't care too much about multi talented. They just want to set up a multi node server, and helm chart just have a bunch of templates, parameters, so for you to wire things together. Yeah, that makes sense. We're almost at the end of our time. One question I always ask people, like, what do you think is like the next big thing? What do you see on the horizon? Be the visionary, could be in Timeplus, could be in stream processing, could be in databases, AI, whatever you think is cool. [00:26:00] Yeah, I will say, certainly AI is a big thing, right? Some people think that's some bubble. I have my own opinion. But I guess there's something we need to do together that is making sure AI can get the latest fresh data, correct data. Right. I mean, they- Usually they don't do this very well. They have a lot of historical patterns. But they don't really have good access to the fresh data. That's how data AI and the data can get together. And even not talking about AI itself, it is, there's a lot of movements today to have cloud native data warehouse or ASPR, something like that. People can put more data in the cloud, which led to have a low cost. But, step by step, people will be realizing that is I not just need a huge amount of data. I need to understand what's going on right now. So, the real time data, the streaming data, I [00:27:00] think will have a bigger room to grow. And with the AI, we can even somewhat partner with the AI engine or AI model to come up with a better context and provide a better recommendation for the human being. All right. Yeah, I think that makes a lot of sense. I also feel like, machine learning plus real time analytics data is, probably the next step, integrating those two things. And it happens a little bit already, I think, like a lot of the, what's it called, fraud detection algorithms are basically using some kind of pre trained AI model or machine learning model, and feed the current data for it. All right. Cool. I think that is nice. Anything else you want to put out. Anything you want to share with the world? Yeah. So, I'm wearing the t shirt of Timeplus. If [00:28:00] you are watching the video, but if not, just go to Timeplus.com. I think there's some button to allow you to try either our open source version or the free trial version on cloud or our self help host. If you are a GitHub user, If you can give us a star or post a project that would be great. So give us some feedback to raise your requirements or report some issues or help us on the documentation or just chat with us in our community Slack. You can learn more, people, different use cases and help each other to train. Build a better world together with fresh and correct data. That's awesome. I'm happy to put all of those links in the show notes. If you want to find anything, you'll find it. Yeah, Jove, thank you very much for being here. It was a pleasure having you. I hope we see at a conference somewhere soon. Yeah, of course. I think I think for the first time, I think we never met in person. But, for the audience, [00:29:00] you know the drill. Next time, next week, same place. And I hope you come back and listen in again. Thank you very much for being here as well.