Event Streaming Beyond Kafka

In this episode of Cloud Commute, Christina Lin from Redpanda Data discusses how Redpanda simplifies data streaming as an alternative to Kafka. She and Chris Engelbert explore its ease of use, cloud compatibility, and innovations like the BYOC service as a seamless Kafka alternative.

In this episode of Cloud Commute, Chris and Christina discuss:

What makes Redpanda a simpler alternative to Kafka?
How does Redpanda's BYOC (Bring Your Own Cloud) service work?
What are the benefits of using Redpanda for cloud-native applications?
How is Redpanda improving Kafka Connect with the Ventus Project?

Interested to learn more about the cloud infrastructure stack like storage, security, and Kubernetes? Head to our website (www.simplyblock.io/cloud-commute-podcast) for more episodes, and follow us on LinkedIn (www.linkedin.com/company/simplyblock-io/mycompany/). You can also check out the detailed show notes on Youtube (https://youtu.be/HhMoa46Y03A)

You can find Christina Lin (Director of Developer Advocacy at Redpanda) on Linkedin: https://www.linkedin.com/in/weimeilin/

About simplyblock:

Simplyblock is an intelligent database storage orchestrator for IO-intensive workloads in Kubernetes, including databases and analytics solutions. It uses smart NVMe caching to speed up read I/O latency and queries. Single system connects local NVMe disks, GP3 volumes, and S3 making it easier to handle storage capacity and performance. With the benefits of thin provisioning, storage tiering, and volume pooling, your database workloads get better performance at lower cost without changes to existing AWS infrastructure.

👉 Get started with simplyblock: https://www.simplyblock.io/buy-now

🏪 simplyblock AWS Marketplace: https://aws.amazon.com/marketplace/seller-profile?id=seller-fzdtuccq3edzm

Christina Lin: Basically what we do is we'll install a small agent. This agent will take care of provisioning. So basically it will set up your network, it will spin up your EKS, it will spin up the storages that you need by default, and then, you know, getting everything set up for you in Kubernetes. And then everything is within your VPC. And then this agent doesn't get any requests. The traffic doesn't go into the agent itself. The agent does the pull. So it pulls data, it pulls requests, it pulls instructions from our control center, from our control plane, so there's no incoming connections into the cluster. So basically that makes it a lot safer. Chris Engelbert: We don't use an agent, but we basically said, okay, let's use the AWS marketplace, and it kind of does the same thing. Say, I need a cluster, free nodes, whatever, and it will automatically deploy it all for you. Hello everyone, welcome back to this week's episode of Simplyblock's Cloud Commute Podcast. Again, like every week, another incredible guest. Actually one guest that I think I haven't met before in person at least, which I know is not like the typical thing I'm saying. But welcome to the show Christina. Glad to have you. And maybe just introduce yourself real quick, who are you? What are you doing? And why do you think I invited you? Christina Lin: Actually, that's one of the questions I want to ask, Chris. So hi everybody, I'm Christina Lin and I'm currently a developer advocate in Redpanda data. So basically what we do is we are a replacement of Kafka to make it simple for you to kind of understand what we do And before that I was working in Red Hat for 10 years straight doing a lot of, I started in JBoss, which is which is a Java web server, like in the very old days. And then I switched to do some kind of like the integration work, ESBs, because I was doing that before Red Hat. And I did Apache Camo, so I did a lot with the community. I still love the community, by the way, Apache Camo. And then I moved on to do some kind of content on Kubernetes because at the time we're running a bunch of like our integration applications on top of Kubernetes. And of course Red Hat has its own version of Kubernetes OpenShift, of course. So I did all some of that and then now I end up here doing a lot of the Kafka and streaming related contents here. Chris Engelbert: That's interesting. I didn't even know you were at Red Hat. Seems like I actually had the right T-shirt it's one of the spirit animal, sirius, years and years ago. Cool. JBoss. Wow. That was that is very ancient. Any kind of application server is very ancient. Well, I mean, Christina Lin: I started my job when there's not even like a web server, right? So when I- and I was using WebSphere, WebLogic. I don't know if you know, recall these names, but yeah, we were using all those before, very early on. Chris Engelbert: For the audience, in case you don't know, you haven't missed anything. No worries. I Christina Lin: have to say, yeah. You don't want to know Chris Engelbert: those. Glad we agree on that. All right. You already said you're with Redpanda right now. Redpanda being a re-implementation of the Kafka protocol, I think based on C++ is one of the ideas, right, getting rid of the JVM as like the underlying technology. I Christina Lin: think, you know, like for people that's familiar with Java, like me, I started programming in Java and some of those, like, well, I started with assembly code and then Java and stuff like that. But like one of the hardest thing with Java is it's garbage collection, right? So if you want to have a very, you know, efficient running Java, you have to continuously, especially if it's a lone running service, you need to continuously tune your garbage collector, you know, the heap sizes and all that. So, you know, if you tune it very well, if you know what you're doing, you can actually have a very efficient running Java program. But the problem is I think people writing Java and the people hosting your Java services is two different types of people. So then you've got this SREs and operation people that's, you know, managing your services. And it's been doing that for like the past, like 10 or 15 years when I'm in the industry, like the people don't know what Java is. They're just- there's a bunch of people that, you know, that helps with, you know, getting, you know, the your, the Linux running, they're hosting your, you know, your Java application. They don't really know how to tune your JVMs. And so for having that available and having that efficient running, it's just a lot of work. And I think that's why I kind of really like what Redpanda is doing, because it's taking away all the complexity. It makes it simple and easy for people that's managing the service. So they don't have to think about, you know, what's the memory management problem right now. So all they have to do is just running that single binary application, and it will just work. And that's what I like about Chris Engelbert: Redpanda. That makes sense. I mean I moved to developer advocacy years ago. I've done quite my good share of like garbage collection, optimization consulting. So go for it. Take it away. I don't need to make money out of it these days. Christina Lin: Exactly. When I was a JBoss consultant, most of my time is spending like teaching people how to fine tune their JVM, right? So Chris Engelbert: I always loved how Oracle, Oracle WebSphere, had the perfect garbage collection parameters in their documentation, because that was what people were asking for. And I was like, Christina Lin: yeah. It never worked out for us. We have to always kind of figure. So we have to get all these tools and monitoring our garbage collection, managing our, looking at our memories to fine tune with ourselves, right? There's no like one size fits all solution for all that. So I think similar to that, when you're deploying your Kafka cluster, apart from setting up- before the, apart from, you get to set up like your Zookeeper to keep track of the distributed systems and then, you know, all that crazy stuff that happens. I think when Redpanda came out, it doesn't have all that. So that's why I really like it. Chris Engelbert: I think Redpanda might be the one reason why KRaft mode actually exists, like the Kafka raft without Zookeeper. I think you actually made that happen because it seemed like nobody cared before. I hate Zookeeper with a passion, like, I can't even explain how dirty I feel would I, if I would have to touch Zookeeper. Christina Lin: I mean, yeah, it's fine. It's, you know, Zookeeper is fine until you need to change your IP addresses and then everything, your network setting. And everything just go berserk after that. So, you know, cause in Red Hat, we had our own version of Kafka before. So it was also using open source Kafka. So we, I had a chance of like playing with it and I just don't like it so much. Yeah. I Chris Engelbert: don't know, I wouldn't, I wouldn't call Zookeeper simple, even just like- even if it works, it just looks so unnecessary compared to something like etcd or whatever kind of stuff, right? Alright, So Kafka streaming data, why would you need that? I mean, I have an idea, but the audience may not. Christina Lin: Right. I mean, it's always about like, when you were talking about communications in that, like in the app, the behavior of application, you always got like two things, right? Synchronous and asynchronous. And for asynchronous, you want to have a confirmations and acknowledgement right back to your request, right. But there's always another type of communication where you want to just like tell people, Hey, this is what I'm saying. And then you get a bunch of other subscribers. So people has been doing that for a very long time. Like before Kafka started, we have our messaging system, right? Like MQs and all that. Like that's way before. But I think what Kafka brings is the ability to scale, right? Because I don't know if you have ever installed or helped managing a messaging system, the way that it does, you know, the guarantees of deliveries, the way that it does the backups, I think at the time when I was doing it, we're doing a lot of active MQs when we're doing a lot of like MQ kind of content. They all store in the database or a different storage, but in order to actually do that distribution in a much more native way, Kafka makes it a lot easier to scale because it's basically doing replication across all the different nodes in your network. So the way that it does, it allows you to scale infinitely and being able to, and because of the way that it can scale infinitely, you can actually get more ingest more data in a much more scalable way. And it kind of fits into the cloud native world as well, because cloud native is all about scalability, it's all about like, you know, flexibilities and, you know, ability to actually increase your instances being scalable. So I think that kind of came in the right time. So that's why it took off and then people are now thinking like, Hey, instead of like doing a request, a response and waiting for that, you know, very slow communication patterns, why don't we just do asynchronous and keep sending, like, especially with devices like IOT devices and edge devices, they have a huge volume of traffic. They're just trying to get into the system. There's no way that they can do this request and response anymore. So this asynchronized way of like communication makes sense. And there's more data coming in. That's why we need the streaming data to be able to handle all the loads on the internet, like the requests that we're getting today. So that's basically what people are using it for. Other than that, I think we have been using, you know, messaging system for, you know, enterprise service bus before, but now we have microservices. And for microservices, you don't want everything to be you know, as request and response because that becomes very sticky. So you want it to be more event driven. And to build that event driven backbone, you need to have an asynchronous way of, you know, sending your data around the entire system and that's where you also find the use of the streaming platform as well. So that's kind of where I see also happens as well. Chris Engelbert: Yeah, I, think that makes sense. What I loved as an example in the past was you order something on an online shop, like a big online shop having thousands and hundreds of thousands of orders at the same time. But you get a confirmation email, confirmation message on your phone, whatever. But both things normally take about like a minute to maybe 10 minutes. Because it is exactly that. You put it onto a queue and an email sent might just take a second, but that means the second you can't do anything else. So it piles up a little bit, but it will eventually get around to you and say, okay, this is now it's your time. Christina Lin: Exactly. If you get an order, if you store it directly into a database, that makes sense, right? That's how people store their orders. But like in today's world, we have multiple different systems. They all want to react on that particular order. So this push notification, this pushing way of notifying the consumers makes a lot more sense. Chris Engelbert: Yeah, exactly. Right. You order the or you store the order immediately and you, at the same time, send off the message like, Hey, there was a new order whoever wants to do something, go ahead. So Redpanda I think you started as a standalone service, basically being installed on an on prem system or whatever you want to call it these days. But there's also Redpanda Cloud for two years, maybe a year and a half or something, maybe more? Christina Lin: Around two years. Yes. So there's a lot. There's a lot going on in Redpanda 'cause I think we've been trying to evolve into something that actually people wants different type of services. So we started with like self hosted, like you run your own Redpanda and just work. And developer loves it because they like to run it in their own laptop because it's very small, very tiny, simple binary. Like it's, it all works. But then we started to see people wanted to have it help managed because, of course, managing a distributed data system is not that easy. So we started doing all that for people. And then we found another need from our customer, that we call it like BYOC. BYOC meaning Bring Your Own Cloud. So basically what that means is that we'll deploy our own cluster, our cluster, into your cloud environment. So the actual user or the enterprise does not expose their data outside of their own realm, right? So everything is within their VPCs. Everything is kind of within their control. And the way that we did it is very smart. So basically what we do is we'll install a small agent. This agent will take care of provisioning. So basically we'll set up your network. It will spin up a, you know, if you're on Amazon, you'll spin up your EKS and then you'll stop, you'll spin up the storages that you need by default. And then, you know, go to getting everything set up for you in Kubernetes. And then everything it's within your VPC. And then this agent doesn't do or get requests. The traffic doesn't go into the agent itself. The agent does the pull. So it pulls data, it pulls requests, it pulls instructions from our control center, from our control plane. So there's no like pushing, pushing incoming connections into the cluster. So basically that makes it a lot safer. Chris Engelbert: Okay. Oh, interesting. That is- We don't use an agent. But we basically said, okay, let's use the AWS Marketplace and it kind of does the same thing, right? You say I need a cluster, free nodes, whatever, and it will automatically deploy it all for you. Christina Lin: The different thing is we manage the cluster within your cloud. So when you do that with Marketplace, when you spin it up, you don't actually manage the patching, you don't manage everything that- the customer has to come and do a little bit like on their own. Chris Engelbert: Yeah, yeah. But for Christina Lin: BYOC, everything is managed through our control plane. So when they're, whenever there's a patching that needs to be done, whenever there is something that needs to be updated that our agent will be notified. But we don't do that right away. We'll have to check with the customer and says, Hey, do you wanna do your updates today? Chris Engelbert: Right, right. Is Christina Lin: there anything crazy is happening? So we'll do that for them then. Chris Engelbert: Oh, that's interesting. Yeah. We do connect to our control plane, but we do not do patching of the virtual machines, I think. Christina Lin: Right. That's Chris Engelbert: a good point though. I never thought about that. Christina Lin: Yeah, it's basically a managing service, but in your own cloud environment. And then we also have this new services that came out like a year ago. It's called Serverless. So it's basically getting your own topic running. So it's basically a managed service, but like on our own. And then we'll give you like a virtual cluster instead of an actual cluster, so you're working with virtual clusters. And some people like virtual clusters because they can divide their different functions, different departments, you know, like different things, like in their own cluster. So they're like independent, they're isolated. So they're not- you're not seeing other people's work and other people's topics. That makes Chris Engelbert: a lot of sense. The thing, and we already talked about that, the thing that I loved most about Redpanda was how easy it was to set up. I think one of the first times we talked was actually because I was building this CDC tool specifically to timescale. And obviously because it's basically Debezium compatible it, I used Kafka. But I didn't feel like the fun of setting that up. So Redpanda, it was- I mean, eventually it was like all test containers and Confluent Jitware, the Apache Kafka team just came around with the like KRaft mode. But Redpanda was like so much easier. And it's actually starting up like so much faster as well. It's Christina Lin: insane. People notice the difference when they start up. Oh yeah. Chris Engelbert: I can very much say that. Yes. Christina Lin: Yep, exactly. Chris Engelbert: So for me, it was in the end, it was like test containers. The test container set up was like, okay, I try with Apache Kafka, but most of the tests run with Redpanda because it's just like so much faster. Christina Lin: Right. Well, even with test containers, I don't know if you noticed but test containers, when I first saw test containers or similar ideas in that is with when I was in Red Hat. It was way, way back in the time. So we were, there was like a new Java framework, Quarkus. So Quarkus has this idea of like spinning up test container, like testing environment for running your unit tests and stuff like that, right. So when I saw that, when I was doing one of the Kafka examples, and I saw they spin up a, you know, a Redpanda service. And I was like, what is that? Why don't you use Kafka? And what is that? And that's when I know where Redpanda is. So we work closely with the test container as well. And we support like, not other than Java, but we support other languages like Golang and all that for them as well. Chris Engelbert: But by the way, for the audience we're talking about test containers here. We had Oleg just a few weeks ago. I'll put it somewhere, whatever, right. You'll find the link in the show notes. For the Redpanda Cloud in itself, let me guess, there's Kubernetes underneath? Is that how the magic works? Or do you also use the agent to set up like the well, you said it's EKS, so it probably still is Kubernetes. Yes. Christina Lin: So for the managed service, the BYOCs and the serverless, everything is running on top of Kubernetes. So we do support three flavors now, like EKS, GKEs, and like the one on Azure AKS, right? So we do support all three right now. So they all run on top of Kubernetes. And then most, some of our customers, they run on bare metal. If they want like extreme, like performance. If they want to do that, like, of course you can't really do that with like a virtualization, like technology, right? If they want to like super bare metal, they'll just run it on bare metal machines. Chris Engelbert: I think that is not a hundred percent true anymore. Like in the past when the like hyperscalers didn't have like the dedicated resource virtual machines, that was very much true. With the, at least my experience is that if you go for like dedicated network net, network bandwidth, complicated word, network bandwidth and dedicated CPUs and stuff, or virtual CPUs, you're normally pretty, pretty, like predictable in performance and very consistent in performance. But you pay for it. I mean, you could probably take your own hardware for that price. So it's all fine. Christina Lin: Well, the thing about bare metal is like because how the- I'm just saying that squeezing out of all the juices out of the hardware. I'm just saying, like, if you're running Kubernetes on top, you need to have, you know, dedicated- the way that REdpanda runs-, because we're writing everything in C Stars, so the way that we allocate the memory is that we will pre allocate a like a amount of memories, like, so that we will do the memory management for you, and then we'll pre allocate. So we'll just hold onto all the threads in our CPUs. So if you, so for, if you're running on a machine, right, if you don't dedicate everything over to Redpanda, that means that because you're running Kubernetes and other services on top, you still need to kind of dedicate a few more cores and memories for that. So that's what I meant, but yeah. But in Kubernetes, it's easier to scale out. So like, this is a trade off. Chris Engelbert: Yeah. Okay. That's fair. I mean, it's kind of the same thing that I keep telling with people with databases. I mean, if you have a small database running on Kubernetes, it's fine. And it's totally fine to also have it on a shared system if you have a database on a load, it's still fine to run it in Kubernetes. Don't get me wrong. I mean, the orchestration is still awesome. But you really want to have like a pinned node tainted so that nothing else, except for like the bare minimum of Kubernetes services and the database is running on that. And you're basically dedicating a whole node to your database, but you would do that either way. Christina Lin: And then you have like tens of thousands of services and things to look after. Of course, I want to go to CloudNative because I have all my code like under control with GitOps and all that, right. So it's like different trade offs and things. Chris Engelbert: Right. I think we answered that question just so slightly before, but why would a customer typically choose Redpanda over something like Apache Kafka or Confluent Kafka or any kind of other Kafka distribution? Christina Lin: Well, first of all, I think it's healthy to have some options in the market. I don't think it's always good to have one dominant technology because, you know, it will never go forward. So it's good to have some competitions, right? So that's first thing. And then why would people choose over Redpanda over Kafka, I mean, it's all about choices. And for me, the choice is all about simplicity, right? So how fast I can get one services running, how fast I can get that thing working, and how easy that is for me to operate on my environment. That's like the first thing I could think about if I want to go for Redpanda. And that's why I choose Redpanda over Kafka. And of course you can always have the same kind of like, you know, performances on, you know, using different technologies, but the effort that you need to put in to run a Kafka cluster compared to what you have put into a Redpanda cluster, it's totally different. And that's one of the best thing about it. And I think one of the thing that we're currently working on is we acquired an open source project called Ventus a couple of months ago. So this one is kind of a similar technology for integration. For that, I mean, like we can do Kafka Connect related operations, like things for- So we can actually stream data from databases, we can stream data from 2S3, you know, all this different data source to RedPanda or different Kafka streaming systems, right? So we can do that. And the difference is that it's written in Golang. It's also a single binary. So it's all about simplicity. So you don't have to stand up like a connector. So if you want to start up a connector- I'm pretty sure because you said you did something with KafkaConnect. So you have to add the libraries into your KafkaConnect, like a Kafka container if you're running a container, you need to have libraries And then you have to configure that with JSON and you need to know all the like the Java package like all the Java package for all the different endpoints. And then you configure that and that works But for the Redpanda Connect it's basically just a decorative file with YAML So it just tells you, okay, I'm, connecting from Kafka to an S3, and these are all the things I need to put in. It will just do that itself. And it's because it's written in Go, so it's a lot lighter weight and you don't have- and then because Go has Go routines, so it's easier to kind of scale it out. So that's kind of what we're working on right now. Chris Engelbert: Okay. Because we're, we're across the 20 minute mark already. How would I get started? Well, what's, the easiest way? Docker, Kubernetes? Christina Lin: The easiest way to just come to our getting started page and then you'll find our how to get started. And if you don't want to, if you're a developer, if you're not an operation person, I don't want to know about like spinning up, like how to scale up or run a Redpanda. There's a Redpanda service. So sign up for Redpanda service. You get a free tier of like a topic running right away. And then we have a lot of sample code on it. And then there's Redpanda University, which I constantly update the content. So I manage the whole university courses. So if you have any questions, reach out to me and definitely check out our Slack community because our Slack community has all engineers in there. They're all happy to answer your questions and it's totally free. So I'll definitely like meet you there as well. Chris Engelbert: All right, cool. For the sake of time, what do you think is like the next big thing? It's basically the last question I always ask. Like, could be in streaming, could be in databases, could be in AI, could be in- I'm Christina Lin: glad you brought up AI because everybody's about AI today. How do you not do anything with AI today? I have my own Chris Engelbert: opinion about that. Christina Lin: Me too. I need some AI detox, but okay, but anyway. But we are- so for the connectors I just mentioned, we are adding a few AI connectors on it so you can seamlessly, decoratively, without writing any LangChain code, you know, like LangChain has a lot of code, but you don't. So it's basically easier for you to stream your data and then put your- you can actually stream your unstructured data, and then breaking it down to embeddings, and then put it into a vector database with all done in a decorative way. So you don't have to do anything. So we're just preparing you for enabling you to get much easier get your data into AI ready state. Chris Engelbert: All right. Interesting. That is cool. So you, basically put the whole pipeline together, like the transformation pipeline, a decorator- declarative way. Oh God. I'm not good at talking today. I mean, Christina Lin: like for unstructured data, cause you can do some chunking and then you send it to embedding models and then you store that in a vector database. But for streaming data, it doesn't have to be a hundred percent doing chunkings, depending on what you want to do with the data, of course. But so for us, it's just like getting your data into embeddings. It turns it- so basically you just configure, Hey, I want, I'm calling this embedding model and just turns into embedding and then put it in a vector database. It just stores in a vector. So now it supports like Pinecon and then it supports, you know, like OpenSearch, it supports you know, all the Mongo and Atlas and all that. It just supports all Chris Engelbert: of vector databases. Everyone that has vector database support these days. So basically everyone. Christina Lin: Most of 'em are like Postgres. Like Postgres Vector, Chris Engelbert: right? I'm a Postgres guy. I'm, okay with pgvector , Which is great, right, ' Christina Lin: cause you don't need to learn too much, like everything, you know, like is there, which is awesome. Chris Engelbert: Which is one of the things that I find personally, like very attractive to the Postgres ecosystem. That because Postgres itself is like so extensible, you have everything. I mean, pgvector is not the only thing you have. Geo information system support. You have a graph database implemented as an extension to Postgres. You have the key value store. You have JSON document store. You have, I don't know. In the worst case, I mean, you don't want that to be done in production, but in the worst case, you could put everything in a single database and have a single query across all of those like multi model tables, which is, I mean, if we're talking about a multi model database, Postgres is probably what you're looking for. Christina Lin: Exactly, right? Chris Engelbert: Not, not, I don't think it was meant that way, but it kind of developed and it makes sense, sometimes you just have like small pieces of specific data and it makes sense to just throw it in as well if you figure out. You may need a bigger graph database. Hey, there's options. It's always good to, as you said, it's always good to have options. Christina Lin: Yeah, exactly. Chris Engelbert: All right. Cool, yeah. Thank you for being here. Is there anything else you want to let the audience know? Christina Lin: Oh, no, it's good. It's thanks for having me here. I hope to see you again and then don't forget to check our Slack community. I'm there. Chris Engelbert: Okay, perfect. That's a great one. Oh, and don't forget to go through the Getting Started pages, right? And the university and what- And everything. Christina Lin: Everything I just said. Chris Engelbert: Right. And everything you said. Okay. Yeah, cool. As I said, thank you very much for being here. It was a pleasure having you. And for the audience you know where to find us. Same time, same place next week. And I hope you listen in again. Christina Lin: And subscribe. Chris Engelbert: And subscribe. There you go. All right. Thank you.

Event Streaming Beyond Kafka | Christina Lin

Cloud Frontier

Cloud Commute