Observability with OpenTelemetry and Dash0

In this episode of Cloud Commute, Michele Mancioppi, Head of Product at Dash0, discusses how Dash0 enhances Kubernetes-native observability using OpenTelemetry. He and Chris Engelbert explore the importance of context in telemetry data and how standardization simplifies observability. Mancioppi also touches on balancing product management with technical expertise.

In this episode of Cloud Commute, Chris and Michele discuss:

What OpenTelemetry brings to observability and its benefits.
How Dash0 simplifies Kubernetes-native observability.
The importance of context when analyzing telemetry data.
The evolving role of AI and machine learning in observability solutions.

Interested to learn more about the cloud infrastructure stack like storage, security, and Kubernetes? Head to our website (www.simplyblock.io/cloud-commute-podcast) for more episodes, and follow us on LinkedIn (www.linkedin.com/company/simplyblock-io/mycompany/). You can also check out the detailed show notes on Youtube (https://youtu.be/9QMaykrLduw)

You can find Christina Lin (Head of Product at Dash0) on Linkedin: https://www.linkedin.com/in/michelemancioppi/

About simplyblock:

Simplyblock is an intelligent database storage orchestrator for IO-intensive workloads in Kubernetes, including databases and analytics solutions. It uses smart NVMe caching to speed up read I/O latency and queries. Single system connects local NVMe disks, GP3 volumes, and S3 making it easier to handle storage capacity and performance. With the benefits of thin provisioning, storage tiering, and volume pooling, your database workloads get better performance at lower cost without changes to existing AWS infrastructure.

👉 Get started with simplyblock: https://www.simplyblock.io/buy-now

🏪 simplyblock AWS Marketplace: https://aws.amazon.com/marketplace/seller-profile?id=seller-fzdtuccq3edzm

Chris Engelbert: You're basically describing what a product manager should be. Michele Mancioppi: Not all product managers by far are technical enough to understand what the engineers are saying. And that actually leads to a lot of interesting conversations and people saying, Oh it's impossible to do that. Chris Engelbert: I agree. And I've been there many times That's why I'm saying it is basically exactly what a product manager should be. At least having the basic understanding of what is happening. And if people say we have an issue you should at least get the feeling of, okay I get a rough idea of what is happening. Michele Mancioppi: The complex bits are actually the model in terms of entities or what you would call resources in OpenTelemetry and how you organize the telemetry around them And it takes a lot of work to actually be able to bridge effectively a set of semantic conventions designed in OpenTelemetry also with the merging of Elastic Common Schema on top of what you have been doing proprietary for years Chris Engelbert: Welcome back, everyone. Welcome to this week's episode of Simplyblock's Cloud Commute Podcast. I know I say it every time, but this week I have another incredible guest. We actually know for quite a while, again, there is some kind of a pattern. We've been colleagues before. We worked for the same company. Michele, welcome and happy to have you. Thank you for being here. So maybe just quickly introduce yourself, who you're where you're from, what you've done in the past. And why do you think I invited you? Michele Mancioppi: Well, it's a lot of questions. So my name is Michele Mancioppi. I am a head of product at Dash0, an observability startup that is redefining what OpenTelemetry native observability means. My career path has been strange. I started out as a full stack engineer at SAP. Then went into performance engineering with the JVM. Then turns out I had a bad moment having to figure out some very nasty bug and the observability at SAP was not excellent. Spent a few years fixing that. And then I decided it was funnier to do observability than to use observability, transform myself into a product manager, joined the Instana, where I met you. And I have been effectively like product person by day and engineer by night ever since. Chris Engelbert: All right, cool. How do you define engineer by night? Michele Mancioppi: I code more than I should and I let technical knowledge drive a lot of the decisions that I have. So I talk engineer with engineering, and business with product. And that kind of Chris Engelbert: helps. So you're basically describing what a product manager should be. Michele Mancioppi: Not all product managers, by far, are technical enough to understand what the engineers are saying. And that actually leads to a lot of interesting conversations and people saying, Oh, it's impossible to do that. Chris Engelbert: I agree. And I've been there many times. That's why I'm saying it should, it is basically exactly what a product manager should be. At least having the basic understanding of what is happening. And if people say, we have an issue, you should at least get the feeling of, okay, I get a rough idea of what is happening. Michele Mancioppi: I, yeah. I agree on that. I don't think the industry does. I see sometimes people like going directly from the university to product manager, management, and I'm wondering, how is that going to work? Chris Engelbert: Yeah, it's exactly like the second group, which goes to university, studies IT, because you're supposed to earn a lot. I'm sorry if I'm insulting anyone. That's just my personal experience, right, blah, blah, blah. So they study informatics because it's so cool. And then they figure out, software engineering in itself is actually really boring. Because we never have, like, the, I guess, the interesting problems to solve, I don't know. But yeah. Then, they either work as an engineer for a year or two. They have really hate it. And go straight for management, whatever positions, or product management, or as they said, right out of university, like, okay, let's fix that now. Michele Mancioppi: Yeah, product management. I mean, I don't want to gatekeep, but I don't feel that product management should be a way to escape the sometimes unnecessary complexity of the technology. It's actually a way to increase your impact by effectively programming through Outlook. Chris Engelbert: Let's not go deeper into the Outlook stuff. Maybe, maybe excess. Anyway, Michele Mancioppi: Don't carbon date. Let's not carbon date ourselves. Yeah, Chris Engelbert: exactly. So, tell us a little bit about Dash0. You said you guys are working on observability specifically towards like, OpenTelemetry, which is slightly different from what we did at Instana, I guess. Michele Mancioppi: Yes. So the, we live in interesting times in terms of observability. When we look back, for example, how the state of the art was around 2015 up to effectively 2020. If you wanted to create an observability solution, a very large amount of the effort had to be devoted in collecting the data that you would then process to, you know, provide insights. OpenTel Engine has upended that entirely, where there is a pretty professional, very high quality set of implementations. Well thought through for collecting all types of telemetry. The quality of the metadata varies a bit, but the protocol, for example, OTLP is pretty good. And that effectively means that the competition between tools is less about which data you can collect and more about what you do with that data. Which is great, because it's a part of observability that was completely underserved for a very long while. The user experience, both in terms of the way that you interact with the UI, and the ease that you have to get your questions answered, I thought it was due to a refresh. And something else is that OpenTelemetry has a very rich set of correlations between the various signals. Signals being, you know, traces, metrics, logs, soon profiles. They are intercorrelated in many different ways. And there is an incredible amount of value that you can provide to the user of the solution by actually tapping into those correlations in ways that, for example, either style we didn't do. Chris Engelbert: Okay, interesting. Just to make sure that everyone's on the same page and understands what observability is, it's basically giving you insight into not just like a database or an application specifically, but basically also giving you insight all across the stack. Like basically from the users clicking somewhere to the response that makes stuff happening in the browser again. Michele Mancioppi: Yes. It depends on what you need in terms of observability, it depends on which systems you are taking care of. If your application is a monolith with a database behind, the kind of telemetry that you're going to collect is going to be different and probably simpler than the kind of telemetry you need to, to collecting some aggressively microservice distributed application. The trade-offs, for example, if you have more monoliths and NAS distributed systems, you veer much more in terms of metrics and launching you do than in terms of traces, right, if your application doesn't have a front end, you do not need to have regular user monitoring. If your application doesn't have heavy load, probably don't need profiling. So the kind of data that you need, it depends very much on what your application does. And it can go from something relatively tame, like logs and a few metrics to a lot of different things that need to answer different aspects of your questions. And the more complicated the data is, the richer it is, the more important it is that the tool works with you to show you what's important to look at. And that it allows you to switch between these types of between the signals, very ergonomically, very quickly, so that you can actually start looking at the problem from different perspectives. I'll give you a very concrete example. Logs are usually very good in telling you that one particular thing happened. But it's relatively hard to find out what things a log is correlated with, which requests were being served as the event described by that log occurred? The request is tracing. And tracing allows you to find out what happened before while serving the same end user request and what happened afterwards. So tracing effectively creates a path between all your applications that you can correlate events and different things. But also tracing is quite expensive if you want to store everything. So you start collecting those metrics, which is a very cheap representation, but has less information. So you really need to be able to balance all these different aspects to get to a trade off, in terms of which information you have and how it is for you to query and how much it's going to cost you, that you're going to be happy with. Chris Engelbert: And I think you mentioned something where, well, you hinted at something very important. Something we also always put in like front and center at Instana and I see Dash0 is doing the same thing, which is you need context. You just don't need like a single signal or a single like, log whatever message, but you actually need the context in, and where it happened, how it happened, and what order it happened. And that is basically, I guess, what you were hinting at. Michele Mancioppi: I am known to say that telemetry without context is just data. If you see a metric called error with a value 0.8, is it good or is it bad? Should you wake up in the middle of the night? The answer is, who knows? What type of errors on which system? What is the impact? Is it happening in QA? Can it wait until tomorrow? Is it happening in production? It depends which service it affects. If is the checkout service, probably there is all people, well, hands on deck. If it's happening in some back processing for the analytics for the product people, yeah, I can't wait. They're not going to notice until tomorrow. Right? So there is, effectively the, It's better to some extent to have less telemetry, but very well annotated with context, so that you understand what it means, and not only in terms of semantically what it means, like a metric called errors, see, with value 0.8, it's the percentage? It's the counter? What does it mean? Right. Chris Engelbert: And is that like a standard value or is there a sudden increase, a sudden drop, anything like that? Yes, it's always in the context or in relation to other values and previous and afterwards. So going with OpenTelemetry, as being an basically an official standard, I think it's not an ISO standard or anything, but there's like a big group that basically went for it and said, okay, let's standardize observability. Before that, there was always with open tracing and the other one was OpenCensus. Yes. But the name I always forgot about, because nobody really seemed to care for that. And they basically came together initially and said, okay, let's standardize. And I think even at Instana, towards the end when we were acquired by IBM, I think we already moved towards OpenTelemetry and also got engaged in the group, right. So how would you see that? I mean, now probably all observability vendors go for OpenTelemetry, so it's probably not a competitive advantage, but how would you describe that? How would you say we do this or why we do Michele Mancioppi: this? There is a bit to unpack here. OpenTelemetry is a very large project and it is the second largest project in the Cloud Native Foundation in terms of contributions. And it effectively brings everybody that is in the industry for observability and a whole bunch of adopters. It is a admittedly complex solution because the kind of data structures that you need for observability are complex. Some aspects, some people say, are complicated. I think it serves a lot of use cases and it requires some mastery to make good of it. And that actually is both a product challenge and an opportunity. Because in order to make good use of the very complex, very juicy amount of data and correlations that you have inside OpenTelemetry, you need to design a product around it. The- We see different ways for other companies to go about it. Some embrace it, some give lip service to OpenTelemetry and some rage on Twitter. The I'm not going to name names, but- X please. No. No. No. So there is, for example, at Instana, I was the PM for all things, tracing and agent. And I estimated in 18 months, the amount of work it would take us to effectively map the features of OpenTelemetry on top of the same similar data structures and concepts that we had from our own proprietary tracing technology. It's not just about collecting spans and showing it in a trace view like Jaeger. It's the complex bits are actually the model in terms of entities. So what you would call resources in OpenTelemetry and how you organize the telemetry around them. And it takes a lot of work to actually be able to bridge effectively a set of semantic conventions designed in OpenTelemetry. Also with the merging of the elastic- Elastic Common Schema later on, on top of what you have been doing proprietary for years. It's very difficult. Chris Engelbert: So I think the other part in that is, well, kind of, related to the question before, at Instana, we basically, as you said, built the whole tracing thing. So what we made, well, not agents, but, collectors for all kinds of different technologies. And I think that is where an open source project of the size of OpenTelemetry comes in very handy because basically everyone is part of it. Everyone implements collectors. Oftentimes vendors themselves, like database vendors build them themselves and just contribute them or keeping them up to date themselves. I think that makes a lot of stuff much easier and probably even better because a lot of the vendors might know much better what kind of data is actually necessary to run the database. How do you see that? Michele Mancioppi: The- I would not use the word collector. The collector is something different. We could think of it as exporters in Prometheus. At Instana, we call them sensors. So effectively, interfaces that would provide you telemetry. The fact that we see more and more instrumentation, more and more metric standpoints built into libraries and frameworks upstream, in general, it's a very good trend. It's not always the case that these instrumentations do an exceedingly good job of following semantic conventions because there is a large amount of lore that one needs to master in order to know which attributes to use in which situations. And sometimes there are categories of metadata that you may want to export that do not have a corresponding entry in the OpenTelemetry Semantic Conventions. And then you have the choice to engage the community or not. You'll go ahead do their own things. But it's definitely a very welcome trend. The- What I think has happened is that across the industry- So the, for example, at Instana, like the cost of actually collecting is true, collecting telemetry was immense. As the product manager for all things disability tracing and agent, I had more than half of engineering working in teams for which I was PMing. And it was effectively trying to empty an ocean with a spoon to support all types of libraries in all types of versions because often the work you need to do is bespoke, without being able to contribute to the library upstream. There were some very bad offenders. Like at some point I counted how many different HTTP server and client libraries we had to support in Java. It was upwards of 50. That's insanity. The fact that a lot of libraries are actually starting to integrate deep telemetry API to define how, for example, the spans are going to look like. That's very, very welcome in particular for technologies where the metadata is well understood, amazing. For things like metrics, we don't need to reinvent everything from scratch because Prometheus has, the Prometheus community has made an amazing job of creating all types of exporters. They were way lighter in terms of semantic conventions than OpenTelemetry is, but that's a gap that is being bridged by the two communities working together. Some steps- So recently Prometheus released version 3.0. There are things that are going to happen over the lifetime of the 3.x range of versions of Prometheus that are going to make the life, of the entity operation with OpenTelemetry much better. And I'm really looking forward to that. And then of the main signals that everybody uses, the logs are the ones that need less standardized structure. For example, when you look at the amount of semantic conventions for spans and metrics, they're like 10 times, 20 times more than bottom OpenTelemetry had so far to define for logs. Because a lot of the things that you want to define in logs is more like where is the log coming from? And then a bunch of metadata about what this specific log means in your system. It means that that metadata is effectively specific to your application in large part. And there the challenge with logs is less in terms of the quality of the metadata, and more the fact that a lot of logging libraries or a lot of ways that you collect logs are not structured. For example, in containerized environments, like Kubernetes, your application could directly emit logs towards an OpenTelemetry backend using the OTLP protocol, or you could let the container runtime collect system out and system error, put it on file, and then whatever's reserved in that file has to parse the log again, to find out what the severity is, if it's a JSON log which fields it as a metadata and there is a bajillion amount of different log formats. So, yeah Chris Engelbert: I think one thing you haven't mentioned was, sure, we had that many HTTP implementations in Instana, but with every single release, some of those broke and you now had to build yet another implementation, which was just so slightly different. I remember that from Hazelcast as well, when we used to instrument, hibernate, and it was such a beauty as well. We already crossed the 20 minute, but I have certainly three more questions. First of all, how, would I get started with Dash0? You already said you went out of, or you went into beta, and as a mostly cloud Kubernetes podcast, I think people would love to understand how to get started on Kubernetes. Michele Mancioppi: So we launched a beta a couple of weeks ago. And there is going to be something big happening around KubeCon. If you are in Salt Lake City, come by and say hi. For Kubernetes, there is a bunch of options available. We have, of course, a first party operator, which makes the connection of all types of telemetry from your cluster very easy. For example, tracing, not just applications, out of the box, collecting the logs from the nodes, collecting metrics about the pods and Kube API, and there's going to be more. So trying to get a very healthy baseline of telemetry for your Kubernetes cluster, with absolutely no toil on your hands. If you're already using the upstream OpenTelemetry operator, great. Point it to Dash0, it works. If you're using OpenTelemetry SDKs in your applications, great. Point it to Dash0, it works. So there is a whole range of possibilities to get data into Dash0 because effectively the contract with a user is the OTLP protocol. Chris Engelbert: Okay, just to make sure. Dash0, like Instana, is a hosted platform. It's software as a service. All right. Then let's go to the last two questions. My favorite ones actually, like, what do you think is the next big thing? What do you see in the future? It could be observability cloud, databases, whatever. AI? That's a very good question. I am Michele Mancioppi: pretty excited about AI technology and LLMs reaching the plateau of productivity. The amount of hype that we have seen on social media has undermined the actual value these technologies can bring. I am looking forward for a certain category of individuals to move on to the next big thing. And so there can be very serious work done with those technologies. I love the some of the interactivity that LLMs bring. For example, I hate chatbots, but for example, using LLMs to explain complex data, that is very cool. There's also a whole bunch of maybe less fashionable but very powerful machine learning capabilities that one can use and integrate, for example, to parse structure nodes. In terms of the general industry trend, and that's I'm not quite sure. I mean the, we had crypto, then we have AI, I dunno what are going, people to be- to get excited about. I personally get excited by important stuff becoming boring because then it means that you can really be productive with that and put it in the back pocket. Kubernetes is slowly getting boring. That is good. I don't know what's going to happen with cloud repatriation or if the cloud providers are going to keep growing at a steady clip like we've seen. I don't know. I don't know how many people are going to go back to more monolithic architectures. My stance has always been that distributed systems are great if you have problems that distributed systems can solve, which tend to be more organizational problems than technical ones. So decoupling the output of different development teams. I would welcome applications in general becoming less complicated and easier to operate. Because at the end of the day, providing a good quality service to your users is what matters the most. Chris Engelbert: I think you mentioned two very important things. You want to see boring stuff basically taken away from you. And I, unfortunately, I don't remember who said that, but a while ago there was this, like, meme going around, like, I want AI to do my laundry so I can do the interesting stuff. And I, it's not like that I want to AI to do the interesting stuff so I can do my laundry. One, and I agree, LLMs, from my perspective, are unfortunately still very much misunderstood by a lot of people on how to use them. One amazing use case that we were talking about with a previous guest, and I'll link it, he was working or he's working, researching in cybersecurity. And they were working on CacheWarp, which was basically a bug in the CPU on Intel CPUs if you use multi threading, with cache clearance. And the interesting thing is that, I don't remember, the AMD documentation, no, I think it was a bug in AMD CPUs, the AMD documentation said the behavior is undefined. The Intel documentation for the same combination actually said it's a security hole. Don't do this. And he brought up the idea of like an LLM, you could actually feed both documentations in and say, okay, please find potential disalignment in the documentation and tell me about that. I think that is a really good use case because nobody wants to read this. Like, I don't know what the X64, aMD 64 documentation is by now, but it's probably like multiple thousand pages. Nobody wants to read that. Michele Mancioppi: I don't know how reliable can LLMs be in terms of summarizing data. There were recently studies about that, where the LLMs tend to miss important stuff and hallucinate others. And it seems to be something built in the way the technology. Yes. It's built in. I guess that there is going to be the need of a lot of error correction on the human side. And it is easier to correct false positives. So, you know, the LLM telling you that is broken than false negatives, where the LLM does not tell you at all. Chris Engelbert: But I think the same is true for humanity, right? if you ask one person to compare those two things, the person will miss a lot or just misremember or misunderstand. So Michele Mancioppi: you Chris Engelbert: have the same problems. Michele Mancioppi: I think that the incentives may be, if somebody that goes and compares the two specifications, defense security issues, I think that they have some very specific drive and motivation that will help them. Chris Engelbert: Okay, that is very much fair. I'll give you that. Yeah. All right. Last question. Is there anything else you want to tell the audience about? Michele Mancioppi: Keep your software boring and provide an excellent Chris Engelbert: UX. And sign up for Dash0 Beta. Michele Mancioppi: Try it out. I'm trying to make it as delightful and boring in terms of surprise that will give you as possible. But it should actually hopefully redefine what you expect from Chris Engelbert: your monitoring tool. Go for it. All right. Cool. Michele, thank you very much for being here. For the audience, thank you for listening in again. You know the spiel, same place, same time next week. And I hope you're coming back. Thank you. Are you going Michele Mancioppi: to say to do the YouTube pleasantries? Chris Engelbert: Oh yeah, sure. Subscribe, like, subscribe, whatever. Just do the YouTube, general YouTube thing. Michele Mancioppi: Thank you, Chris. Bye everyone.

Observability with OpenTelemetry and Dash0 | Michele Mancioppi

Cloud Frontier

Cloud Commute