Change Data Capture and Stream Processing in the Cloud

Gunnar Morling, from Decodable, a company simplifying the use of Apache Flink, talks about how to extract modifications from a database using change data capture tools, such as Debezium, and how to use Apache Flink or Decodable to run your real-time analytics or queries in a stream processing fashion.

If you have questions to Gunnar, you can reach him here:

Blog: https://www.morling.dev/
LinkedIn: https://www.linkedin.com/in/gunnar-morling-2b44b7229
X/Twitter: https://twitter.com/gunnarmorling
Mastodon: https://mastodon.online/@gunnarmorling

If you are interested in Decodable, you can find them here:

Website: https://www.decodable.co/
X/Twitter: https://twitter.com/Decodableco

Additional show notes:

The Cloud Commute Podcast is presented by simplyblock (https://www.simplyblock.io)

01:00:01
Let me just put it out into the community,

01:00:03
and let's make a challenge out of

01:00:05
it and essentially ask people, so

01:00:07
how fast can you be

01:00:08
with Java to process one billion

01:00:11
rows of a CSV file?

01:00:16
You're listening to simplyblock's Cloud Commute Podcast,

01:00:18
your weekly 20 minute

01:00:19
podcast about cloud technologies,

01:00:21
Kubernetes, security,

01:00:22
sustainability, and more.

01:00:26
Hello everyone. Welcome back to

01:00:27
the next episode of simplyblock's

01:00:29
Cloud Commute podcast.

01:00:31
Today I have a really good guest,

01:00:34
and a really good friend with me. We

01:00:37
know each other for quite a

01:00:39
while. I don't know, many, many,

01:00:41
many years. Another fellow German.

01:00:45
And I guess a lot of,

01:00:46
at least when you're in the Java

01:00:47
world, you must have heard of him.

01:00:50
You must have heard him.

01:00:51
Gunnar, welcome. Happy to have you.

01:00:54
Chris, hello,

01:00:55
everybody. Thank you so much,

01:00:57
family. Super excited. Yes, I

01:00:58
don't know, to be honest, for how

01:00:59
long we have known each other.

01:01:01
Yes, definitely quite a few years,

01:01:02
you know, always running into each

01:01:04
other in the Java community.

01:01:06
Right. I think the German Java

01:01:07
community is very encapsulated.

01:01:10
There's a good chance, you know,

01:01:13
a good chunk of them.

01:01:15
I mean, you would actively have to

01:01:16
try and avoid each other,

01:01:18
I guess, if you really don't want

01:01:20
to meet somebody.

01:01:21
That is very, very true. So, well,

01:01:24
we already heard who you are, but

01:01:26
maybe you can give a little bit of

01:01:28
a deeper introduction of yourself.

01:01:30
Sure. So, I'm Gunnar. I

01:01:32
work as a software engineer right

01:01:34
now at a company called Decodable.

01:01:36
We are a small startup in the data

01:01:38
streaming space, essentially

01:01:40
moving and

01:01:41
processing your data. And I

01:01:42
think we will talk more about what

01:01:44
that means. So, that's my current

01:01:46
role. And I have, you know,

01:01:48
a bit of a mixed role between

01:01:49
engineering and then also doing

01:01:51
outreach work,

01:01:52
like doing blog posts,

01:01:54
podcasts, maybe sometimes, going

01:01:56
to conferences, talking about

01:01:57
things. So, that's what I'm

01:01:59
currently doing. Before that, I've

01:02:01
been for exactly up to the day,

01:02:03
exactly for 10 years at Red Hat,

01:02:06
where I worked on several

01:02:07
projects. So, I started working

01:02:08
on, you know,

01:02:09
different projects from the

01:02:10
Hibernate umbrella. Yes, it's

01:02:12
still a thing. I still like it.

01:02:14
So, I was doing that for

01:02:15
roughly five years working on Bean

01:02:16
Validation. I was the spec lead

01:02:18
for Bean Validation 2.0,

01:02:19
for instance, which I think is

01:02:21
also how we met or I believe we

01:02:22
interacted somehow with

01:02:24
in the context of Bean Validation.

01:02:25
I remember something there.

01:02:27
And then, well, I worked on

01:02:30
a project which is called

01:02:31
Debezium. It's a

01:02:32
tool and a platform

01:02:33
for change data capture. And

01:02:34
again, we will dive into that. But

01:02:37
I guess that's what people might

01:02:38
know me for. I'm also a Java

01:02:40
champion as you are, Chris. And I

01:02:42
did this challenge. I need to

01:02:43
mention it. I did this kind of

01:02:45
viral challenge in the Java space.

01:02:47
Some people might also have come

01:02:49
across my name in that context.

01:02:51
All right. Let's get back to the

01:02:52
challenge in a moment. Maybe say

01:02:54
a couple of words about Decodable.

01:02:56
I think everyone knows Red Hat and

01:02:57
everything Red Hat does. So,

01:02:59
talk about Decodable.

01:03:00
Yes. So,

01:03:02
essentially, we built a SaaS, a

01:03:04
software as a service for

01:03:06
stream processing. This means,

01:03:08
essentially, it connects to all

01:03:10
kinds of data systems,

01:03:11
let's say databases like Postgres

01:03:12
or MySQL, streaming platforms like

01:03:15
Kafka, Apache Pulsar.

01:03:17
It takes data from those kinds of

01:03:19
systems. And in the simplest case,

01:03:21
it just takes this data and

01:03:22
puts it into something like

01:03:23
Snowflake, like a search index,

01:03:25
maybe another database, maybe S3,

01:03:28
maybe something like Apache Pino

01:03:29
or Clickhouse. So, it's about data

01:03:31
movement in the simplest case,

01:03:32
taking data from one place to

01:03:33
another. And very importantly, all

01:03:35
this happens in real time. So,

01:03:37
it's not batch driven, like, you

01:03:38
know, running once per hour, once

01:03:39
per day or whatever. But this

01:03:40
happens in near real time. So, not

01:03:43
in the hard, you know, computer

01:03:45
science sense of the word,

01:03:47
with a fixed SLA, but with a very

01:03:49
low latency, like seconds,

01:03:51
typically. But then, you know,

01:03:54
going beyond data movement,

01:03:55
there's also what we would call

01:03:56
data processing. So, it's about

01:03:59
filtering your data, transforming

01:04:00
it, routing it, joining multiple

01:04:02
of those real time data streams,

01:04:05
doing things like groupings, real

01:04:07
time analytics of this data, so

01:04:08
you could gain insight

01:04:10
into your data. So, this is what

01:04:11
we do. It's based on Apache Flink

01:04:13
as a stream processing engine.

01:04:14
It's based on Debezium as a CDC

01:04:16
tool. So, this gives you a source

01:04:18
connectivity with all kinds

01:04:19
of databases. And yeah, people use

01:04:21
it for, as I mentioned, for taking

01:04:22
data from one place to

01:04:24
another, but then also for, I

01:04:26
don't know, doing fraud detection,

01:04:27
gaining insight into their

01:04:29
purchase orders or customers, you

01:04:32
know, all those

01:04:33
kinds of things, really.

01:04:35
All right, cool. Let's talk about

01:04:38
your challenge real quick, because

01:04:39
you already mentioned

01:04:40
stream processing. Before we go on

01:04:42
with, like, the other stuff, like,

01:04:44
let's talk about the challenge.

01:04:46
What was that about?

01:04:47
What was that

01:04:48
about? Yes, this was, to be

01:04:49
honest, it was kind of a random

01:04:50
thing, which I started over the

01:04:52
holidays between, you know,

01:04:55
Christmas and New

01:04:55
Year's Eve. So, this

01:04:56
had been on my mind for quite some

01:04:58
time, doing something like with

01:04:59
processing one billion rows,

01:05:01
because that's what it was, to one

01:05:02
billion row challenge. And this

01:05:04
had been on my mind for a

01:05:05
while. And I know somehow, then I

01:05:06
had this idea, okay, let me just

01:05:08
put it out into the community,

01:05:10
and let's make a challenge out of

01:05:11
it and essentially ask people, so

01:05:13
how fast can you be

01:05:15
with Java to process one billion

01:05:17
rows of a CSV file, essentially?

01:05:20
And the task was, you know,

01:05:22
to take temperature measurements,

01:05:23
which were given in that file, and

01:05:25
aggregate them per weather

01:05:28
station. So, the measurements or

01:05:29
the rows in this file were

01:05:30
essentially always

01:05:31
like, you know, a

01:05:32
weather station name and then a

01:05:33
temperature value. And you had to

01:05:35
aggregate them per station, which

01:05:37
means you had to get the minimum,

01:05:38
the maximum and the mean value per

01:05:40
station. So, this was the task.

01:05:42
And then it kind of took off. So,

01:05:43
like, you know, many

01:05:44
people from the community

01:05:47
entered this challenge and also

01:05:48
like really big names like Aleksey Shipilëv,

01:05:50
Cliff Click,

01:05:53
Thomas Wuerthinger, the leads of GraalVm at

01:05:55
Oracle and many, many others, they

01:05:57
started to work on this and they

01:06:00
kept working on it for the entire

01:06:02
month of January. And like really

01:06:04
bringing down those

01:06:05
execution times, essentially, in

01:06:07
the end, it was like less than two

01:06:09
seconds for processing this

01:06:11
file, which I had with 13

01:06:13
gigabytes of size on an eight core

01:06:15
CPU configuration.

01:06:18
I think the important thing is he

01:06:19
said less than a second, which is

01:06:21
already impressive because a

01:06:22
lot of people think Java is slow

01:06:23
and everything. Right. We know

01:06:26
those terms and those claims.

01:06:29
By the way, I should clarify. So,

01:06:32
you know, I mean, this is highly

01:06:34
parallelizable, right? So,

01:06:35
it depends on the number of CPU

01:06:37
cores. So, the less than a second

01:06:39
number, I think like 350

01:06:41
milliseconds or so. This was an

01:06:42
old 32 cores I had in this machine

01:06:44
with hyperthreading,

01:06:46
with turbo boost. So, this was the

01:06:48
best I could get.

01:06:49
But it also included reading

01:06:51
those, like 13 gigs,

01:06:53
right? And I think

01:06:54
that is impressive.

01:06:55
Yes. But again, then reading from

01:06:57
memory. So, essentially, I wanted

01:06:59
to make sure that disk IO

01:07:01
is not part of the equation

01:07:02
because it would be super hard to

01:07:04
measure for me anyway. So,

01:07:06
that's why I said, okay, I will

01:07:07
have everything in a RAM disk.

01:07:09
And, you know, so everything comes

01:07:11
or came out of memory for that context.

01:07:13
Okay. Got it. Got it. But

01:07:15
still, it got pretty viral.

01:07:16
I've seen it from the start and I

01:07:19
was kind of blown away by who

01:07:21
joined that discussion. It was

01:07:23
really cool to look after and to

01:07:27
just follow up. I didn't have time

01:07:29
to jump into that myself,

01:07:31
but by the numbers and the results

01:07:33
I've seen, I would

01:07:35
have not won anyway.

01:07:36
Oh, yeah.

01:07:37
That me not wasting time.

01:07:39
Absolutely. I mean, people pulled

01:07:41
off like really crazy tricks to

01:07:43
get there. And by the way,

01:07:45
if you're at JavaLand in a few

01:07:46
weeks, I will do a talk about some

01:07:48
of those things in Java land.

01:07:50
I think by the time this comes

01:07:52
out, it was a few

01:07:53
weeks ago. But we'll see.

01:07:55
Okay. I made the mistake for every

01:07:58
recording. I made

01:07:58
the temporal reference.

01:07:59
That's totally fine. I think a lot

01:08:02
of the JavaLand talks are now

01:08:04
recorded these days

01:08:06
and they will show up on YouTube.

01:08:08
So when this comes out and the

01:08:10
talks are already available,

01:08:12
I'll just put it in the show notes.

01:08:13
Perfect.

01:08:14
All right. So that

01:08:15
was the challenge. Let's

01:08:16
get back to Decodable. You

01:08:18
mentioned Apache Flink being like

01:08:20
the underlying technology

01:08:22
build on. So how does that work?

01:08:26
So Apache Flink, essentially,

01:08:28
that's an open source project

01:08:30
which concerns

01:08:31
itself with real-time data

01:08:34
processing. So it's essentially an

01:08:36
engine for processing either

01:08:39
bounded or unbounded streams

01:08:41
of events. So there's also a way

01:08:43
where you could use it in a batch

01:08:44
mode. But this is not what we

01:08:47
are too interested in so far. It's

01:08:48
always about unbounded data

01:08:49
streams coming from a Kafka topic,

01:08:52
so it takes those event streams,

01:08:55
it defines semantics on those

01:08:57
event streams. Like what's

01:08:58
an event time? What does it mean

01:09:00
if an event arrives late or out of

01:09:02
order? So you have the

01:09:03
building blocks for all those

01:09:04
kinds of things. Then you have a

01:09:06
stack, a layer of

01:09:08
APIs, which allow you

01:09:09
to implement stream processing

01:09:12
applications. So there's more

01:09:15
imperative APIs,

01:09:17
which in particular

01:09:17
is called the data streaming API.

01:09:19
So there you really program in

01:09:21
Java, typically,

01:09:23
or Scala, I guess,

01:09:24
your flow in an imperative way.

01:09:27
Yeah Scala, I don't know who

01:09:28
does it, but that

01:09:29
may be some people.

01:09:31
And then there's more and more

01:09:33
abstract APIs. So there's a table

01:09:34
API, which essentially gives you

01:09:36
like a relational programming

01:09:38
paradigm. And finally, there's

01:09:40
Flink SQL, which also is what

01:09:42
Decodable employs heavily in the

01:09:44
product. So there you reason about

01:09:46
your data streams in terms

01:09:48
of SQL. So let's say, you know,

01:09:49
you want to take the data from an

01:09:52
external system, you would express

01:09:53
this as a create table statement,

01:09:55
and then this table would be

01:09:56
backed by a Kafka topic. And you

01:09:58
can do a select then from such a

01:10:00
table. And then of course you can

01:10:01
do, you know, projections by

01:10:03
massaging your select clause. You

01:10:06
can do filterings by adding where

01:10:07
clauses, you can join multiple

01:10:10
streams by well using the join

01:10:12
operator and you can do windowed

01:10:13
aggregations. So I would feel

01:10:16
that's the most accessible way for

01:10:18
doing stream processing, because

01:10:19
there's of course, a large

01:10:20
number of people who can implement

01:10:22
a SQL, right? Right. And I just

01:10:24
wanted to say, and it's all like

01:10:25
a SQL dialect, it's pretty close

01:10:28
as far as I've seen to the

01:10:30
original like standard SQL.

01:10:32
Yes, exactly. And then there's a few

01:10:33
extensions, you know, because you

01:10:35
need to have this notion of event

01:10:36
time or what does it mean? How do

01:10:38
you express how much lateness you

01:10:40
would be willing to accept

01:10:41
for an aggregation? So there's a

01:10:43
few extensions like that. But

01:10:44
overall, it's SQL. For my demos,

01:10:46
oftentimes, I can start working on

01:10:48
Postgres, developing, develop some

01:10:50
queries on Postgres,

01:10:50
and then I just take them, paste

01:10:52
them into like the Flink SQL

01:10:53
client, and they might just run as

01:10:55
is, or they may need a little bit

01:10:56
of adjustment, but it's pretty

01:10:58
much standard SQL.

01:10:59
All right, cool.

01:11:00
Cool. The other thing you

01:11:01
mentioned was the Bezium. And I

01:11:03
know you, I think

01:11:04
you originally started

01:11:05
Debezium. Is that true?

01:11:07
It's not true. No, I did not start it. It

01:11:10
was somebody else at Red Hat,

01:11:12
Randall Hauck, he's now at

01:11:13
Confluent. But I took over the

01:11:17
project quite early on. So

01:11:18
Randall started it. And I know I

01:11:20
came in after a few months, I

01:11:22
believe. And yeah, I think this is

01:11:24
when it like really took off,

01:11:26
right? So, you know, I went to

01:11:27
many conferences, I

01:11:29
spoke about it. And

01:11:30
of course, others as well. The

01:11:32
team grew at Red Hat. So yeah, I

01:11:34
was the lead for

01:11:35
quite a few years.

01:11:37
So for the people that don't know,

01:11:39
maybe just give a few words about

01:11:41
what Debezium is,

01:11:42
what it does, and why it is so cool.

01:11:43
Right. Yes. Oh,

01:11:44
man, where should I start?

01:11:47
In a nutshell, it's a tool for

01:11:50
what's called change data capture.

01:11:51
So this means it taps into

01:11:53
the transaction log of your

01:11:55
database. And then whenever

01:11:57
there's an insert or

01:11:58
an update or delete,

01:11:59
it will capture this event, and it

01:12:01
will propagate it to consumers. So

01:12:04
essentially, you could think

01:12:05
about it like the observer pattern

01:12:07
for your database. So whenever

01:12:09
there's a data change,

01:12:10
like a new customer record gets

01:12:12
created, or purchase order gets

01:12:13
updated, those kinds of things,

01:12:15
you can, you know, react and

01:12:17
extract this change event from the

01:12:18
database, push it to consumers,

01:12:21
either via Kafka or via pullbacks

01:12:24
in an API way, or via, you know,

01:12:27
Google Cloud PubSub,

01:12:28
Kinesis, all those kinds of

01:12:30
things. And then well, you can

01:12:31
take those events and it enables

01:12:33
a ton of use cases. So you know,

01:12:36
in the simplest case, it's just

01:12:38
about replication. So taking data

01:12:40
from your operational database to

01:12:41
your cloud data warehouse, or to

01:12:43
your search index, or maybe to

01:12:45
cache. But then also people use

01:12:47
change data capture for doing

01:12:49
things like microservices,

01:12:52
data exchange, because I mean,

01:12:53
microservices, they, you want to

01:12:55
have them self dependent,

01:12:56
but still, they need to exchange

01:12:58
data, right? So they don't exist

01:12:59
in isolation, and change data

01:13:01
capture can help with that in

01:13:03
particular, with what's called the

01:13:04
outbox pattern, just on the

01:13:05
side note, people use it for

01:13:07
splitting up monolithic systems

01:13:09
into microservices,

01:13:11
you can use this change

01:13:12
event stream as an audit log. I

01:13:13
mean, if you kind of think about

01:13:15
it, it's, you

01:13:15
know, if you just keep

01:13:16
those events, all the updates to

01:13:18
purchase order, we put them into a

01:13:21
database, it's kind of like a

01:13:22
search index, right? Maybe you

01:13:23
want to enrich it with a bit of

01:13:24
metadata. You can do streaming

01:13:26
queries. So I know you maybe you

01:13:28
want to spot specific patterns in

01:13:30
your data as it changes,

01:13:31
and then trigger some sort of

01:13:33
alert. That's the use case, and

01:13:35
many, many more, but really,

01:13:36
it's a super versatile tool, I

01:13:38
would say.

01:13:39
Yeah, and I also

01:13:42
have a couple of

01:13:43
talks on that area.

01:13:44
And I think my favorite example,

01:13:46
that's something that everyone

01:13:47
understands is that you have some

01:13:50
order coming in, and now you want

01:13:52
to send out invoices. Invoices

01:13:54
don't need to be sent like,

01:13:56
in the same operation, but you

01:13:59
want to make sure that you only

01:14:00
send out the invoice if the

01:14:02
invoice was, or if the order was

01:14:04
actually generated in the

01:14:06
database. So that is where the

01:14:07
outbox pattern comes in, or just

01:14:09
looking at the order table in

01:14:11
general, and filtering out all the

01:14:12
new orders.

01:14:14
Yes.

01:14:14
So yeah,

01:14:15
absolutely great tool. Love it. It

01:14:18
supports many, many databases. Any

01:14:20
idea how many so far?

01:14:22
It keeps growing.

01:14:23
I know, certainly 10 or

01:14:26
so or more. The interesting thing

01:14:28
there is, well, you know, there is

01:14:31
not a standardized way you could

01:14:33
implement something like Debezium.

01:14:35
So each of the databases, they

01:14:36
have their own APIs, formats, their

01:14:38
own ways for extracting

01:14:41
those change events, which means

01:14:42
there needs to be a dedicated

01:14:44
Debezium connector for each

01:14:45
database, which we want to

01:14:46
support. And then the core team,

01:14:49
you know, added support for MySQL,

01:14:51
Postgres, SQL Server, Oracle, Cassandra,

01:14:53
MongoDB, and so on. But then what

01:14:56
happened is that also other

01:14:57
companies and other organizations

01:14:59
picked up the Debezium framework.

01:15:01
So for instance, now something

01:15:02
like Google Cloud Spanner, it's

01:15:04
also supported via Debezium,

01:15:06
because the team at

01:15:07
Google, they decide,

01:15:08
okay, they want to expose change

01:15:10
events based on the Debezium event

01:15:12
format and infrastructure or

01:15:15
ScyllaDB. So they maintain their

01:15:16
own CDC connector, but it's based

01:15:18
on Debezium. And the nice thing

01:15:20
about that is that it gives you as

01:15:23
a user, one unified change event

01:15:25
format, right? So you don't

01:15:26
have to care, which is the

01:15:27
particular source database, does

01:15:29
it come from Cloud Spanner,

01:15:30
or does it come from Postgres? You

01:15:31
can process those events in a

01:15:32
unified way, which I think is

01:15:35
just great to see that it

01:15:36
establishes itself as a sort of a

01:15:38
de facto standard, I would say.

01:15:39
Yeah, I think that is important.

01:15:41
That is a very, very good point.

01:15:44
Debezium basically defined a JSON

01:15:46
and I think Avro standard.

01:15:50
Right. So I mean, you know, it

01:15:51
defines the, let's say, the

01:15:54
semantic

01:15:54
structure, like, you know,

01:15:56
what are the fields, what are the

01:15:57
types, how are they organized, and

01:15:59
then how you serialize it as

01:16:01
Avro, JSON, or protocol buffers.

01:16:04
That's essentially like a

01:16:06
pluggable concern.

01:16:08
Right. So we said earlier,

01:16:10
Decodable is a cloud platform. So

01:16:12
you basically have,

01:16:14
in a little bit of a mean term,

01:16:15
you have Apache Flink on steroids,

01:16:18
ready to use, plus a couple

01:16:20
of stuff on top of that. So maybe

01:16:22
talk a little bit about that.

01:16:24
Right. So yes, that's the

01:16:26
underlying tech, I would say. And

01:16:28
then of course, if you want to

01:16:31
put those things into production,

01:16:34
there's so many things you need to

01:16:35
consider. Right. So

01:16:36
how do you just go about

01:16:38
developing and versioning those

01:16:39
SQL statements? If you

01:16:40
iterate on a statement,

01:16:42
you want to have maybe like a

01:16:43
preview and get a feeling or maybe

01:16:46
just validation of this. So we

01:16:47
have all this editing experience,

01:16:49
preview. Then maybe you don't want

01:16:53
that all of your users in

01:16:55
your organization can access all

01:16:57
those streaming pipelines, which

01:16:58
you have. Right. So you want to

01:16:59
have something like role-based

01:17:01
access control. You want to have

01:17:03
managed connectors. You want to

01:17:08
have automatic provisioning and

01:17:11
sizing of your infrastructure. So

01:17:13
you don't want to think too

01:17:15
much, "hey, do I need to keep like

01:17:17
five machines for this dataflow

01:17:19
sitting around?" And what happens

01:17:20
if I don't need them? Do I need to

01:17:22
remove them and then scale them

01:17:23
back up again? So all this

01:17:26
auto scaling, auto provisioning,

01:17:27
this is something which we do.

01:17:29
Then we will

01:17:30
primarily allow you to

01:17:32
use SQL to define your queries,

01:17:35
but then also we actually let you

01:17:36
run your own custom Flink jobs.

01:17:38
If that's something which you want

01:17:39
to do, you can do this. We are

01:17:41
very close. And again,

01:17:42
by the time this will be released,

01:17:44
it should be live

01:17:44
already. We will have Python,

01:17:46
PyFlink support, and yeah, many,

01:17:50
many more things. Right. So really

01:17:52
it's a managed experience for

01:17:54
those dataflows.

01:17:56
Right. That makes

01:17:57
a lot of sense. So let me see.

01:18:02
From a user's perspective,

01:18:04
I'm mostly working with SQL. I'm

01:18:06
writing my jobs. I'm deploying

01:18:07
those. Those jobs are

01:18:11
everything from simple ETL to

01:18:14
extract, translate, ...

01:18:18
What's the L again?

01:18:22
Load. Load. There you go. Nobody

01:18:24
needs to load data. They just

01:18:26
magically appear. But you can

01:18:28
also do data enrichment. You said

01:18:29
that earlier. You can do joins.

01:18:31
Right. So is there anything I

01:18:34
have to be aware of that is very

01:18:36
complicated compared to just using

01:18:38
a standard database?

01:18:41
Mm. Yeah. I mean, I think this

01:18:44
entire notion of event time, this

01:18:47
definitely is something which

01:18:48
can be challenging. So let's say

01:18:51
you want to do some sort of

01:18:53
windowed analysis, like, you know,

01:18:55
how many purchase orders do I have

01:18:57
per category and hour, you know,

01:19:00
this kind of thing. And now,

01:19:01
depending on what's the source of

01:19:03
your data, those events might

01:19:05
arrive out of order. Right. So

01:19:07
it might be that your hour has

01:19:10
closed. But then, like, five

01:19:12
minutes later,

01:19:13
because some event was

01:19:14
stuck in some queue, you still get

01:19:16
an event for that past hour.

01:19:19
Right. And of course, now the

01:19:20
question is, there's this tradeoff

01:19:22
between, okay, how accurate do you

01:19:24
want your data to be? Essentially,

01:19:26
how long do you want to wait for

01:19:28
those late events versus, well,

01:19:30
what is your

01:19:31
latency? Right. Do you

01:19:31
want to get out this updated count

01:19:33
at the top of the hour? Or can you

01:19:35
afford to wait for those five

01:19:36
minutes? So there's a bit of a

01:19:38
tradeoff. I think, you know, this

01:19:41
entire complex of

01:19:42
event time, I think

01:19:42
that's certainly something where

01:19:43
people often have at least some

01:19:46
time to learn and

01:19:47
grasp the concepts.

01:19:49
Yeah, that's a very good one. In a

01:19:52
previous episode, we had the

01:19:54
discussion about connected

01:19:55
cars. And connected cars may or

01:19:57
may not have an internet

01:20:00
connection all the time. So you

01:20:01
like super, super late events

01:20:03
sometimes. All right.

01:20:05
Because we're almost

01:20:06
running out of time.

01:20:08
Wow. Okay.

01:20:09
Yeah. 20 minutes is

01:20:10
like nothing. What is the biggest

01:20:14
trend you see

01:20:14
right now in terms of

01:20:15
database, in terms of cloud, in

01:20:17
terms of whatever you like?

01:20:19
Right.

01:20:20
I mean, that's a tough one. Well,

01:20:22
I guess there can only be one

01:20:23
answer, right? It has to be AI. I

01:20:25
feel it's like, I

01:20:26
mean, I know it's

01:20:26
boring. But well, the trend is not

01:20:29
boring. But saying it is kind of

01:20:30
boring. But I mean, that's

01:20:31
what I would see. The way I could

01:20:35
see this impact things like we do,

01:20:37
I mean, it could help you just

01:20:38
with like, scaling, of course,

01:20:41
like, you know, we could make

01:20:42
intelligent

01:20:43
predictions about what's

01:20:47
your workload like, maybe we can

01:20:48
take a look at the data and we can

01:20:50
sense, okay, you know, it might

01:20:52
make sense to scale out some more

01:20:53
compute load already, because we

01:20:55
will know with a certain

01:20:56
likelihood that it may be needed

01:20:57
very shortly. I could see that

01:21:00
then, of course, I mean, it could

01:21:01
just help you with authoring those

01:21:02
flows, right? I mean, with all

01:21:05
those LLMs, it might be doable to

01:21:08
give you some sort of guided

01:21:10
experience there. So that's a big

01:21:12
trend for sure.

01:21:13
Then I guess another

01:21:14
one, I would see more technical,

01:21:15
I feel like that's a

01:21:17
unification

01:21:18
happening, right, of systems

01:21:20
and categories of systems. So

01:21:22
right now we have, you know,

01:21:23
databases here,

01:21:25
stream processing engines

01:21:26
there. And I feel those things

01:21:27
might come more closely together.

01:21:29
And you would have real time

01:21:31
streaming capabilities also in

01:21:32
something like Postgres itself.

01:21:34
And I know maybe

01:21:35
would expose Postgres

01:21:36
as a Kafka broker, in a sense. So

01:21:39
I could see also some more, you

01:21:41
know, some closer integration

01:21:43
of those different kinds of tools.

01:21:46
That is that is interesting,

01:21:47
because I also think that there is

01:21:49
a general like movement to, I

01:21:52
mean, in the past we had the

01:21:55
idea of moving to

01:21:57
different databases,

01:21:58
because all of them were very

01:21:59
specific. And now all of the big

01:22:02
databases, Oracle, Postgres,

01:22:05
well, even MySQL, they all start

01:22:07
to integrate all of those like

01:22:08
multi-model

01:22:09
features. And Postgres,

01:22:11
being at the forefront, having

01:22:13
this like super extensibility.

01:22:16
So yeah, that would be interesting.

01:22:18
Right. I mean, it's

01:22:19
it's always going in cycles, I

01:22:21
feel right. And even having this

01:22:23
trend to decomposition, like it

01:22:25
gives you all those good building

01:22:27
blocks, which you then can

01:22:28
put together and I know create a

01:22:29
more cohesive integrated

01:22:31
experience,

01:22:31
right. And then I guess

01:22:32
in five years, we want to tear it

01:22:34
apart again, and like, let people

01:22:35
integrate everything themselves.

01:22:37
In 5 to 10 years, we have the

01:22:39
next iteration of microservices.

01:22:41
We called it SOAP we called it

01:22:43
whatever. Now we call it

01:22:45
microservices. Who knows what we

01:22:46
call it in the future.

01:22:48
All right. Thank you very much.

01:22:50
That was that was a good chat.

01:22:52
Like always, I love talking.

01:22:55
Yeah, thank you so much for having

01:22:57
me. This was great. Enjoy the

01:22:59
conversation. And let's

01:23:00
let's talk soon.

01:23:01
Absolutely. And for everyone else,

01:23:03
come back next week.

01:23:04
A new episode, a new guest. And

01:23:07
thank you very much.

01:23:09
See you.

01:23:11
The cloud commute podcast is sponsored by

01:23:13
simplyblock your own elastic

01:23:15
block storage engine for the cloud.

01:23:17
Get higher IOPS and low predictable

01:23:18
latency while bringing down your

01:23:20
total cost of ownership.

01:23:21
www.simplyblock.io

Change Data Capture and Stream Processing in the Cloud - Gunnar Morling from Decodable

Cloud Frontier

Cloud Commute