Gunnar Morling, from Decodable, a company simplifying the use of Apache Flink, talks about how to extract modifications from a database using change data capture tools, such as Debezium, and how to use Apache Flink or Decodable to run your real-time analytics or queries in a stream processing fashion.
If you have questions to Gunnar, you can reach him here:
- Blog: https://www.morling.dev/
- LinkedIn: https://www.linkedin.com/in/gunnar-morling-2b44b7229
- X/Twitter: https://twitter.com/gunnarmorling
- Mastodon: https://mastodon.online/@gunnarmorling
If you are interested in Decodable, you can find them here:
- Website: https://www.decodable.co/
- X/Twitter: https://twitter.com/Decodableco
Additional show notes:
- Apache Flink: https://flink.apache.org/
- Debezium: https://debezium.io/
- PostgreSQL: https://www.postgresql.org/
- MySQL: https://www.mysql.com/
The Cloud Commute Podcast is presented by simplyblock (https://www.simplyblock.io)
01:00:01
Let me just put it out into the community,
01:00:03
and let's make a challenge out of
01:00:05
it and essentially ask people, so
01:00:07
how fast can you be
01:00:08
with Java to process one billion
01:00:11
rows of a CSV file?
01:00:16
You're listening to simplyblock's Cloud Commute Podcast,
01:00:18
your weekly 20 minute
01:00:19
podcast about cloud technologies,
01:00:21
Kubernetes, security,
01:00:22
sustainability, and more.
01:00:26
Hello everyone. Welcome back to
01:00:27
the next episode of simplyblock's
01:00:29
Cloud Commute podcast.
01:00:31
Today I have a really good guest,
01:00:34
and a really good friend with me. We
01:00:37
know each other for quite a
01:00:39
while. I don't know, many, many,
01:00:41
many years. Another fellow German.
01:00:45
And I guess a lot of,
01:00:46
at least when you're in the Java
01:00:47
world, you must have heard of him.
01:00:50
You must have heard him.
01:00:51
Gunnar, welcome. Happy to have you.
01:00:54
Chris, hello,
01:00:55
everybody. Thank you so much,
01:00:57
family. Super excited. Yes, I
01:00:58
don't know, to be honest, for how
01:00:59
long we have known each other.
01:01:01
Yes, definitely quite a few years,
01:01:02
you know, always running into each
01:01:04
other in the Java community.
01:01:06
Right. I think the German Java
01:01:07
community is very encapsulated.
01:01:10
There's a good chance, you know,
01:01:13
a good chunk of them.
01:01:15
I mean, you would actively have to
01:01:16
try and avoid each other,
01:01:18
I guess, if you really don't want
01:01:20
to meet somebody.
01:01:21
That is very, very true. So, well,
01:01:24
we already heard who you are, but
01:01:26
maybe you can give a little bit of
01:01:28
a deeper introduction of yourself.
01:01:30
Sure. So, I'm Gunnar. I
01:01:32
work as a software engineer right
01:01:34
now at a company called Decodable.
01:01:36
We are a small startup in the data
01:01:38
streaming space, essentially
01:01:40
moving and
01:01:41
processing your data. And I
01:01:42
think we will talk more about what
01:01:44
that means. So, that's my current
01:01:46
role. And I have, you know,
01:01:48
a bit of a mixed role between
01:01:49
engineering and then also doing
01:01:51
outreach work,
01:01:52
like doing blog posts,
01:01:54
podcasts, maybe sometimes, going
01:01:56
to conferences, talking about
01:01:57
things. So, that's what I'm
01:01:59
currently doing. Before that, I've
01:02:01
been for exactly up to the day,
01:02:03
exactly for 10 years at Red Hat,
01:02:06
where I worked on several
01:02:07
projects. So, I started working
01:02:08
on, you know,
01:02:09
different projects from the
01:02:10
Hibernate umbrella. Yes, it's
01:02:12
still a thing. I still like it.
01:02:14
So, I was doing that for
01:02:15
roughly five years working on Bean
01:02:16
Validation. I was the spec lead
01:02:18
for Bean Validation 2.0,
01:02:19
for instance, which I think is
01:02:21
also how we met or I believe we
01:02:22
interacted somehow with
01:02:24
in the context of Bean Validation.
01:02:25
I remember something there.
01:02:27
And then, well, I worked on
01:02:30
a project which is called
01:02:31
Debezium. It's a
01:02:32
tool and a platform
01:02:33
for change data capture. And
01:02:34
again, we will dive into that. But
01:02:37
I guess that's what people might
01:02:38
know me for. I'm also a Java
01:02:40
champion as you are, Chris. And I
01:02:42
did this challenge. I need to
01:02:43
mention it. I did this kind of
01:02:45
viral challenge in the Java space.
01:02:47
Some people might also have come
01:02:49
across my name in that context.
01:02:51
All right. Let's get back to the
01:02:52
challenge in a moment. Maybe say
01:02:54
a couple of words about Decodable.
01:02:56
I think everyone knows Red Hat and
01:02:57
everything Red Hat does. So,
01:02:59
talk about Decodable.
01:03:00
Yes. So,
01:03:02
essentially, we built a SaaS, a
01:03:04
software as a service for
01:03:06
stream processing. This means,
01:03:08
essentially, it connects to all
01:03:10
kinds of data systems,
01:03:11
let's say databases like Postgres
01:03:12
or MySQL, streaming platforms like
01:03:15
Kafka, Apache Pulsar.
01:03:17
It takes data from those kinds of
01:03:19
systems. And in the simplest case,
01:03:21
it just takes this data and
01:03:22
puts it into something like
01:03:23
Snowflake, like a search index,
01:03:25
maybe another database, maybe S3,
01:03:28
maybe something like Apache Pino
01:03:29
or Clickhouse. So, it's about data
01:03:31
movement in the simplest case,
01:03:32
taking data from one place to
01:03:33
another. And very importantly, all
01:03:35
this happens in real time. So,
01:03:37
it's not batch driven, like, you
01:03:38
know, running once per hour, once
01:03:39
per day or whatever. But this
01:03:40
happens in near real time. So, not
01:03:43
in the hard, you know, computer
01:03:45
science sense of the word,
01:03:47
with a fixed SLA, but with a very
01:03:49
low latency, like seconds,
01:03:51
typically. But then, you know,
01:03:54
going beyond data movement,
01:03:55
there's also what we would call
01:03:56
data processing. So, it's about
01:03:59
filtering your data, transforming
01:04:00
it, routing it, joining multiple
01:04:02
of those real time data streams,
01:04:05
doing things like groupings, real
01:04:07
time analytics of this data, so
01:04:08
you could gain insight
01:04:10
into your data. So, this is what
01:04:11
we do. It's based on Apache Flink
01:04:13
as a stream processing engine.
01:04:14
It's based on Debezium as a CDC
01:04:16
tool. So, this gives you a source
01:04:18
connectivity with all kinds
01:04:19
of databases. And yeah, people use
01:04:21
it for, as I mentioned, for taking
01:04:22
data from one place to
01:04:24
another, but then also for, I
01:04:26
don't know, doing fraud detection,
01:04:27
gaining insight into their
01:04:29
purchase orders or customers, you
01:04:32
know, all those
01:04:33
kinds of things, really.
01:04:35
All right, cool. Let's talk about
01:04:38
your challenge real quick, because
01:04:39
you already mentioned
01:04:40
stream processing. Before we go on
01:04:42
with, like, the other stuff, like,
01:04:44
let's talk about the challenge.
01:04:46
What was that about?
01:04:47
What was that
01:04:48
about? Yes, this was, to be
01:04:49
honest, it was kind of a random
01:04:50
thing, which I started over the
01:04:52
holidays between, you know,
01:04:55
Christmas and New
01:04:55
Year's Eve. So, this
01:04:56
had been on my mind for quite some
01:04:58
time, doing something like with
01:04:59
processing one billion rows,
01:05:01
because that's what it was, to one
01:05:02
billion row challenge. And this
01:05:04
had been on my mind for a
01:05:05
while. And I know somehow, then I
01:05:06
had this idea, okay, let me just
01:05:08
put it out into the community,
01:05:10
and let's make a challenge out of
01:05:11
it and essentially ask people, so
01:05:13
how fast can you be
01:05:15
with Java to process one billion
01:05:17
rows of a CSV file, essentially?
01:05:20
And the task was, you know,
01:05:22
to take temperature measurements,
01:05:23
which were given in that file, and
01:05:25
aggregate them per weather
01:05:28
station. So, the measurements or
01:05:29
the rows in this file were
01:05:30
essentially always
01:05:31
like, you know, a
01:05:32
weather station name and then a
01:05:33
temperature value. And you had to
01:05:35
aggregate them per station, which
01:05:37
means you had to get the minimum,
01:05:38
the maximum and the mean value per
01:05:40
station. So, this was the task.
01:05:42
And then it kind of took off. So,
01:05:43
like, you know, many
01:05:44
people from the community
01:05:47
entered this challenge and also
01:05:48
like really big names like Aleksey Shipilëv,
01:05:50
Cliff Click,
01:05:53
Thomas Wuerthinger, the leads of GraalVm at
01:05:55
Oracle and many, many others, they
01:05:57
started to work on this and they
01:06:00
kept working on it for the entire
01:06:02
month of January. And like really
01:06:04
bringing down those
01:06:05
execution times, essentially, in
01:06:07
the end, it was like less than two
01:06:09
seconds for processing this
01:06:11
file, which I had with 13
01:06:13
gigabytes of size on an eight core
01:06:15
CPU configuration.
01:06:18
I think the important thing is he
01:06:19
said less than a second, which is
01:06:21
already impressive because a
01:06:22
lot of people think Java is slow
01:06:23
and everything. Right. We know
01:06:26
those terms and those claims.
01:06:29
By the way, I should clarify. So,
01:06:32
you know, I mean, this is highly
01:06:34
parallelizable, right? So,
01:06:35
it depends on the number of CPU
01:06:37
cores. So, the less than a second
01:06:39
number, I think like 350
01:06:41
milliseconds or so. This was an
01:06:42
old 32 cores I had in this machine
01:06:44
with hyperthreading,
01:06:46
with turbo boost. So, this was the
01:06:48
best I could get.
01:06:49
But it also included reading
01:06:51
those, like 13 gigs,
01:06:53
right? And I think
01:06:54
that is impressive.
01:06:55
Yes. But again, then reading from
01:06:57
memory. So, essentially, I wanted
01:06:59
to make sure that disk IO
01:07:01
is not part of the equation
01:07:02
because it would be super hard to
01:07:04
measure for me anyway. So,
01:07:06
that's why I said, okay, I will
01:07:07
have everything in a RAM disk.
01:07:09
And, you know, so everything comes
01:07:11
or came out of memory for that context.
01:07:13
Okay. Got it. Got it. But
01:07:15
still, it got pretty viral.
01:07:16
I've seen it from the start and I
01:07:19
was kind of blown away by who
01:07:21
joined that discussion. It was
01:07:23
really cool to look after and to
01:07:27
just follow up. I didn't have time
01:07:29
to jump into that myself,
01:07:31
but by the numbers and the results
01:07:33
I've seen, I would
01:07:35
have not won anyway.
01:07:36
Oh, yeah.
01:07:37
That me not wasting time.
01:07:39
Absolutely. I mean, people pulled
01:07:41
off like really crazy tricks to
01:07:43
get there. And by the way,
01:07:45
if you're at JavaLand in a few
01:07:46
weeks, I will do a talk about some
01:07:48
of those things in Java land.
01:07:50
I think by the time this comes
01:07:52
out, it was a few
01:07:53
weeks ago. But we'll see.
01:07:55
Okay. I made the mistake for every
01:07:58
recording. I made
01:07:58
the temporal reference.
01:07:59
That's totally fine. I think a lot
01:08:02
of the JavaLand talks are now
01:08:04
recorded these days
01:08:06
and they will show up on YouTube.
01:08:08
So when this comes out and the
01:08:10
talks are already available,
01:08:12
I'll just put it in the show notes.
01:08:13
Perfect.
01:08:14
All right. So that
01:08:15
was the challenge. Let's
01:08:16
get back to Decodable. You
01:08:18
mentioned Apache Flink being like
01:08:20
the underlying technology
01:08:22
build on. So how does that work?
01:08:26
So Apache Flink, essentially,
01:08:28
that's an open source project
01:08:30
which concerns
01:08:31
itself with real-time data
01:08:34
processing. So it's essentially an
01:08:36
engine for processing either
01:08:39
bounded or unbounded streams
01:08:41
of events. So there's also a way
01:08:43
where you could use it in a batch
01:08:44
mode. But this is not what we
01:08:47
are too interested in so far. It's
01:08:48
always about unbounded data
01:08:49
streams coming from a Kafka topic,
01:08:52
so it takes those event streams,
01:08:55
it defines semantics on those
01:08:57
event streams. Like what's
01:08:58
an event time? What does it mean
01:09:00
if an event arrives late or out of
01:09:02
order? So you have the
01:09:03
building blocks for all those
01:09:04
kinds of things. Then you have a
01:09:06
stack, a layer of
01:09:08
APIs, which allow you
01:09:09
to implement stream processing
01:09:12
applications. So there's more
01:09:15
imperative APIs,
01:09:17
which in particular
01:09:17
is called the data streaming API.
01:09:19
So there you really program in
01:09:21
Java, typically,
01:09:23
or Scala, I guess,
01:09:24
your flow in an imperative way.
01:09:27
Yeah Scala, I don't know who
01:09:28
does it, but that
01:09:29
may be some people.
01:09:31
And then there's more and more
01:09:33
abstract APIs. So there's a table
01:09:34
API, which essentially gives you
01:09:36
like a relational programming
01:09:38
paradigm. And finally, there's
01:09:40
Flink SQL, which also is what
01:09:42
Decodable employs heavily in the
01:09:44
product. So there you reason about
01:09:46
your data streams in terms
01:09:48
of SQL. So let's say, you know,
01:09:49
you want to take the data from an
01:09:52
external system, you would express
01:09:53
this as a create table statement,
01:09:55
and then this table would be
01:09:56
backed by a Kafka topic. And you
01:09:58
can do a select then from such a
01:10:00
table. And then of course you can
01:10:01
do, you know, projections by
01:10:03
massaging your select clause. You
01:10:06
can do filterings by adding where
01:10:07
clauses, you can join multiple
01:10:10
streams by well using the join
01:10:12
operator and you can do windowed
01:10:13
aggregations. So I would feel
01:10:16
that's the most accessible way for
01:10:18
doing stream processing, because
01:10:19
there's of course, a large
01:10:20
number of people who can implement
01:10:22
a SQL, right? Right. And I just
01:10:24
wanted to say, and it's all like
01:10:25
a SQL dialect, it's pretty close
01:10:28
as far as I've seen to the
01:10:30
original like standard SQL.
01:10:32
Yes, exactly. And then there's a few
01:10:33
extensions, you know, because you
01:10:35
need to have this notion of event
01:10:36
time or what does it mean? How do
01:10:38
you express how much lateness you
01:10:40
would be willing to accept
01:10:41
for an aggregation? So there's a
01:10:43
few extensions like that. But
01:10:44
overall, it's SQL. For my demos,
01:10:46
oftentimes, I can start working on
01:10:48
Postgres, developing, develop some
01:10:50
queries on Postgres,
01:10:50
and then I just take them, paste
01:10:52
them into like the Flink SQL
01:10:53
client, and they might just run as
01:10:55
is, or they may need a little bit
01:10:56
of adjustment, but it's pretty
01:10:58
much standard SQL.
01:10:59
All right, cool.
01:11:00
Cool. The other thing you
01:11:01
mentioned was the Bezium. And I
01:11:03
know you, I think
01:11:04
you originally started
01:11:05
Debezium. Is that true?
01:11:07
It's not true. No, I did not start it. It
01:11:10
was somebody else at Red Hat,
01:11:12
Randall Hauck, he's now at
01:11:13
Confluent. But I took over the
01:11:17
project quite early on. So
01:11:18
Randall started it. And I know I
01:11:20
came in after a few months, I
01:11:22
believe. And yeah, I think this is
01:11:24
when it like really took off,
01:11:26
right? So, you know, I went to
01:11:27
many conferences, I
01:11:29
spoke about it. And
01:11:30
of course, others as well. The
01:11:32
team grew at Red Hat. So yeah, I
01:11:34
was the lead for
01:11:35
quite a few years.
01:11:37
So for the people that don't know,
01:11:39
maybe just give a few words about
01:11:41
what Debezium is,
01:11:42
what it does, and why it is so cool.
01:11:43
Right. Yes. Oh,
01:11:44
man, where should I start?
01:11:47
In a nutshell, it's a tool for
01:11:50
what's called change data capture.
01:11:51
So this means it taps into
01:11:53
the transaction log of your
01:11:55
database. And then whenever
01:11:57
there's an insert or
01:11:58
an update or delete,
01:11:59
it will capture this event, and it
01:12:01
will propagate it to consumers. So
01:12:04
essentially, you could think
01:12:05
about it like the observer pattern
01:12:07
for your database. So whenever
01:12:09
there's a data change,
01:12:10
like a new customer record gets
01:12:12
created, or purchase order gets
01:12:13
updated, those kinds of things,
01:12:15
you can, you know, react and
01:12:17
extract this change event from the
01:12:18
database, push it to consumers,
01:12:21
either via Kafka or via pullbacks
01:12:24
in an API way, or via, you know,
01:12:27
Google Cloud PubSub,
01:12:28
Kinesis, all those kinds of
01:12:30
things. And then well, you can
01:12:31
take those events and it enables
01:12:33
a ton of use cases. So you know,
01:12:36
in the simplest case, it's just
01:12:38
about replication. So taking data
01:12:40
from your operational database to
01:12:41
your cloud data warehouse, or to
01:12:43
your search index, or maybe to
01:12:45
cache. But then also people use
01:12:47
change data capture for doing
01:12:49
things like microservices,
01:12:52
data exchange, because I mean,
01:12:53
microservices, they, you want to
01:12:55
have them self dependent,
01:12:56
but still, they need to exchange
01:12:58
data, right? So they don't exist
01:12:59
in isolation, and change data
01:13:01
capture can help with that in
01:13:03
particular, with what's called the
01:13:04
outbox pattern, just on the
01:13:05
side note, people use it for
01:13:07
splitting up monolithic systems
01:13:09
into microservices,
01:13:11
you can use this change
01:13:12
event stream as an audit log. I
01:13:13
mean, if you kind of think about
01:13:15
it, it's, you
01:13:15
know, if you just keep
01:13:16
those events, all the updates to
01:13:18
purchase order, we put them into a
01:13:21
database, it's kind of like a
01:13:22
search index, right? Maybe you
01:13:23
want to enrich it with a bit of
01:13:24
metadata. You can do streaming
01:13:26
queries. So I know you maybe you
01:13:28
want to spot specific patterns in
01:13:30
your data as it changes,
01:13:31
and then trigger some sort of
01:13:33
alert. That's the use case, and
01:13:35
many, many more, but really,
01:13:36
it's a super versatile tool, I
01:13:38
would say.
01:13:39
Yeah, and I also
01:13:42
have a couple of
01:13:43
talks on that area.
01:13:44
And I think my favorite example,
01:13:46
that's something that everyone
01:13:47
understands is that you have some
01:13:50
order coming in, and now you want
01:13:52
to send out invoices. Invoices
01:13:54
don't need to be sent like,
01:13:56
in the same operation, but you
01:13:59
want to make sure that you only
01:14:00
send out the invoice if the
01:14:02
invoice was, or if the order was
01:14:04
actually generated in the
01:14:06
database. So that is where the
01:14:07
outbox pattern comes in, or just
01:14:09
looking at the order table in
01:14:11
general, and filtering out all the
01:14:12
new orders.
01:14:14
Yes.
01:14:14
So yeah,
01:14:15
absolutely great tool. Love it. It
01:14:18
supports many, many databases. Any
01:14:20
idea how many so far?
01:14:22
It keeps growing.
01:14:23
I know, certainly 10 or
01:14:26
so or more. The interesting thing
01:14:28
there is, well, you know, there is
01:14:31
not a standardized way you could
01:14:33
implement something like Debezium.
01:14:35
So each of the databases, they
01:14:36
have their own APIs, formats, their
01:14:38
own ways for extracting
01:14:41
those change events, which means
01:14:42
there needs to be a dedicated
01:14:44
Debezium connector for each
01:14:45
database, which we want to
01:14:46
support. And then the core team,
01:14:49
you know, added support for MySQL,
01:14:51
Postgres, SQL Server, Oracle, Cassandra,
01:14:53
MongoDB, and so on. But then what
01:14:56
happened is that also other
01:14:57
companies and other organizations
01:14:59
picked up the Debezium framework.
01:15:01
So for instance, now something
01:15:02
like Google Cloud Spanner, it's
01:15:04
also supported via Debezium,
01:15:06
because the team at
01:15:07
Google, they decide,
01:15:08
okay, they want to expose change
01:15:10
events based on the Debezium event
01:15:12
format and infrastructure or
01:15:15
ScyllaDB. So they maintain their
01:15:16
own CDC connector, but it's based
01:15:18
on Debezium. And the nice thing
01:15:20
about that is that it gives you as
01:15:23
a user, one unified change event
01:15:25
format, right? So you don't
01:15:26
have to care, which is the
01:15:27
particular source database, does
01:15:29
it come from Cloud Spanner,
01:15:30
or does it come from Postgres? You
01:15:31
can process those events in a
01:15:32
unified way, which I think is
01:15:35
just great to see that it
01:15:36
establishes itself as a sort of a
01:15:38
de facto standard, I would say.
01:15:39
Yeah, I think that is important.
01:15:41
That is a very, very good point.
01:15:44
Debezium basically defined a JSON
01:15:46
and I think Avro standard.
01:15:50
Right. So I mean, you know, it
01:15:51
defines the, let's say, the
01:15:54
semantic
01:15:54
structure, like, you know,
01:15:56
what are the fields, what are the
01:15:57
types, how are they organized, and
01:15:59
then how you serialize it as
01:16:01
Avro, JSON, or protocol buffers.
01:16:04
That's essentially like a
01:16:06
pluggable concern.
01:16:08
Right. So we said earlier,
01:16:10
Decodable is a cloud platform. So
01:16:12
you basically have,
01:16:14
in a little bit of a mean term,
01:16:15
you have Apache Flink on steroids,
01:16:18
ready to use, plus a couple
01:16:20
of stuff on top of that. So maybe
01:16:22
talk a little bit about that.
01:16:24
Right. So yes, that's the
01:16:26
underlying tech, I would say. And
01:16:28
then of course, if you want to
01:16:31
put those things into production,
01:16:34
there's so many things you need to
01:16:35
consider. Right. So
01:16:36
how do you just go about
01:16:38
developing and versioning those
01:16:39
SQL statements? If you
01:16:40
iterate on a statement,
01:16:42
you want to have maybe like a
01:16:43
preview and get a feeling or maybe
01:16:46
just validation of this. So we
01:16:47
have all this editing experience,
01:16:49
preview. Then maybe you don't want
01:16:53
that all of your users in
01:16:55
your organization can access all
01:16:57
those streaming pipelines, which
01:16:58
you have. Right. So you want to
01:16:59
have something like role-based
01:17:01
access control. You want to have
01:17:03
managed connectors. You want to
01:17:08
have automatic provisioning and
01:17:11
sizing of your infrastructure. So
01:17:13
you don't want to think too
01:17:15
much, "hey, do I need to keep like
01:17:17
five machines for this dataflow
01:17:19
sitting around?" And what happens
01:17:20
if I don't need them? Do I need to
01:17:22
remove them and then scale them
01:17:23
back up again? So all this
01:17:26
auto scaling, auto provisioning,
01:17:27
this is something which we do.
01:17:29
Then we will
01:17:30
primarily allow you to
01:17:32
use SQL to define your queries,
01:17:35
but then also we actually let you
01:17:36
run your own custom Flink jobs.
01:17:38
If that's something which you want
01:17:39
to do, you can do this. We are
01:17:41
very close. And again,
01:17:42
by the time this will be released,
01:17:44
it should be live
01:17:44
already. We will have Python,
01:17:46
PyFlink support, and yeah, many,
01:17:50
many more things. Right. So really
01:17:52
it's a managed experience for
01:17:54
those dataflows.
01:17:56
Right. That makes
01:17:57
a lot of sense. So let me see.
01:18:02
From a user's perspective,
01:18:04
I'm mostly working with SQL. I'm
01:18:06
writing my jobs. I'm deploying
01:18:07
those. Those jobs are
01:18:11
everything from simple ETL to
01:18:14
extract, translate, ...
01:18:18
What's the L again?
01:18:22
Load. Load. There you go. Nobody
01:18:24
needs to load data. They just
01:18:26
magically appear. But you can
01:18:28
also do data enrichment. You said
01:18:29
that earlier. You can do joins.
01:18:31
Right. So is there anything I
01:18:34
have to be aware of that is very
01:18:36
complicated compared to just using
01:18:38
a standard database?
01:18:41
Mm. Yeah. I mean, I think this
01:18:44
entire notion of event time, this
01:18:47
definitely is something which
01:18:48
can be challenging. So let's say
01:18:51
you want to do some sort of
01:18:53
windowed analysis, like, you know,
01:18:55
how many purchase orders do I have
01:18:57
per category and hour, you know,
01:19:00
this kind of thing. And now,
01:19:01
depending on what's the source of
01:19:03
your data, those events might
01:19:05
arrive out of order. Right. So
01:19:07
it might be that your hour has
01:19:10
closed. But then, like, five
01:19:12
minutes later,
01:19:13
because some event was
01:19:14
stuck in some queue, you still get
01:19:16
an event for that past hour.
01:19:19
Right. And of course, now the
01:19:20
question is, there's this tradeoff
01:19:22
between, okay, how accurate do you
01:19:24
want your data to be? Essentially,
01:19:26
how long do you want to wait for
01:19:28
those late events versus, well,
01:19:30
what is your
01:19:31
latency? Right. Do you
01:19:31
want to get out this updated count
01:19:33
at the top of the hour? Or can you
01:19:35
afford to wait for those five
01:19:36
minutes? So there's a bit of a
01:19:38
tradeoff. I think, you know, this
01:19:41
entire complex of
01:19:42
event time, I think
01:19:42
that's certainly something where
01:19:43
people often have at least some
01:19:46
time to learn and
01:19:47
grasp the concepts.
01:19:49
Yeah, that's a very good one. In a
01:19:52
previous episode, we had the
01:19:54
discussion about connected
01:19:55
cars. And connected cars may or
01:19:57
may not have an internet
01:20:00
connection all the time. So you
01:20:01
like super, super late events
01:20:03
sometimes. All right.
01:20:05
Because we're almost
01:20:06
running out of time.
01:20:08
Wow. Okay.
01:20:09
Yeah. 20 minutes is
01:20:10
like nothing. What is the biggest
01:20:14
trend you see
01:20:14
right now in terms of
01:20:15
database, in terms of cloud, in
01:20:17
terms of whatever you like?
01:20:19
Right.
01:20:20
I mean, that's a tough one. Well,
01:20:22
I guess there can only be one
01:20:23
answer, right? It has to be AI. I
01:20:25
feel it's like, I
01:20:26
mean, I know it's
01:20:26
boring. But well, the trend is not
01:20:29
boring. But saying it is kind of
01:20:30
boring. But I mean, that's
01:20:31
what I would see. The way I could
01:20:35
see this impact things like we do,
01:20:37
I mean, it could help you just
01:20:38
with like, scaling, of course,
01:20:41
like, you know, we could make
01:20:42
intelligent
01:20:43
predictions about what's
01:20:47
your workload like, maybe we can
01:20:48
take a look at the data and we can
01:20:50
sense, okay, you know, it might
01:20:52
make sense to scale out some more
01:20:53
compute load already, because we
01:20:55
will know with a certain
01:20:56
likelihood that it may be needed
01:20:57
very shortly. I could see that
01:21:00
then, of course, I mean, it could
01:21:01
just help you with authoring those
01:21:02
flows, right? I mean, with all
01:21:05
those LLMs, it might be doable to
01:21:08
give you some sort of guided
01:21:10
experience there. So that's a big
01:21:12
trend for sure.
01:21:13
Then I guess another
01:21:14
one, I would see more technical,
01:21:15
I feel like that's a
01:21:17
unification
01:21:18
happening, right, of systems
01:21:20
and categories of systems. So
01:21:22
right now we have, you know,
01:21:23
databases here,
01:21:25
stream processing engines
01:21:26
there. And I feel those things
01:21:27
might come more closely together.
01:21:29
And you would have real time
01:21:31
streaming capabilities also in
01:21:32
something like Postgres itself.
01:21:34
And I know maybe
01:21:35
would expose Postgres
01:21:36
as a Kafka broker, in a sense. So
01:21:39
I could see also some more, you
01:21:41
know, some closer integration
01:21:43
of those different kinds of tools.
01:21:46
That is that is interesting,
01:21:47
because I also think that there is
01:21:49
a general like movement to, I
01:21:52
mean, in the past we had the
01:21:55
idea of moving to
01:21:57
different databases,
01:21:58
because all of them were very
01:21:59
specific. And now all of the big
01:22:02
databases, Oracle, Postgres,
01:22:05
well, even MySQL, they all start
01:22:07
to integrate all of those like
01:22:08
multi-model
01:22:09
features. And Postgres,
01:22:11
being at the forefront, having
01:22:13
this like super extensibility.
01:22:16
So yeah, that would be interesting.
01:22:18
Right. I mean, it's
01:22:19
it's always going in cycles, I
01:22:21
feel right. And even having this
01:22:23
trend to decomposition, like it
01:22:25
gives you all those good building
01:22:27
blocks, which you then can
01:22:28
put together and I know create a
01:22:29
more cohesive integrated
01:22:31
experience,
01:22:31
right. And then I guess
01:22:32
in five years, we want to tear it
01:22:34
apart again, and like, let people
01:22:35
integrate everything themselves.
01:22:37
In 5 to 10 years, we have the
01:22:39
next iteration of microservices.
01:22:41
We called it SOAP we called it
01:22:43
whatever. Now we call it
01:22:45
microservices. Who knows what we
01:22:46
call it in the future.
01:22:48
All right. Thank you very much.
01:22:50
That was that was a good chat.
01:22:52
Like always, I love talking.
01:22:55
Yeah, thank you so much for having
01:22:57
me. This was great. Enjoy the
01:22:59
conversation. And let's
01:23:00
let's talk soon.
01:23:01
Absolutely. And for everyone else,
01:23:03
come back next week.
01:23:04
A new episode, a new guest. And
01:23:07
thank you very much.
01:23:09
See you.
01:23:11
The cloud commute podcast is sponsored by
01:23:13
simplyblock your own elastic
01:23:15
block storage engine for the cloud.
01:23:17
Get higher IOPS and low predictable
01:23:18
latency while bringing down your
01:23:20
total cost of ownership.
01:23:21
www.simplyblock.io

