Our first repeat guest ever Austin Parker, Principal Developer Advocate at Lightstep, talks about OpenTelemetry: an observability framework for cloud-native software that is a collection of tools, APIs, and SDKs. You use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) for analysis to understand your software's performance and behavior.
The OpenTelemetry specification is now in 1.0! What does this mean?
It means we've achieved a major milestone in defining the observability framework for cloud-native applications, and that production-ready 1.0 releases of OpenTelemetry libraries will start being released over the next weeks and months! 🎉
Should you find a burning need to share your thoughts or rants about the show, please spray them at firstname.lastname@example.org. While you’re going to all the trouble of shipping us some bytes, please consider taking a moment to let us know what you’d like to hear on the show in the future. Despite the all-caps flaming you will receive in response, please know that we are sincerely interested in your feedback; we aim to appease. Follow us on the Twitters: @ObservyMcObserv.
Jonan Scheffler: Hello and welcome back to Observy McObservface, proudly brought to you by New Relic's developer relations team, The Relicans. Observy is about observability in something a bit more than the traditional sense. It's often about technology and tools that we use to gain visibility into our systems. But it is also about people because, fundamentally, software is about people. You can think of Observy as something of an observability variety show where we will apply systems thinking and think critically about challenges across our entire industry. And we very much look forward to having you join us. You can find the show notes for this episode along with all of The Relicans podcasts on developer.newrelic.com/podcasts. We're so pleased to have you here this week. Enjoy the show.
Welcome back to Observy McObservface. This is a very big day for us. We have our very first returning guest.
Austin Parker: Ta-da!
Jonan: Welcome back. How are you, Austin?
Austin: I am great. The best thing about appearing on a podcast is appearing on the same one twice because it means they liked you enough.
Jonan: That you get to come back. And the best part about a returning guest from our perspective is someone liked it enough that they wanted to come back. [Chuckles] It works out.
Austin: It does workout. I love podcasting. I feel like of the ways we can communicate; I feel like podcasts are a really good one.
Jonan: I actually think it's getting a lot better lately. Well, I don't know. I don't know whether it means podcasting is getting better. I think it's arguably worse as more people get into it who are ranked amateurs like myself. But I think that it's becoming more popular unexpectedly. I would have thought that the largest portion of podcasters or people listening to podcasts were commuters, and it's not the case at all because it's spiked hard after the pandemic hit.
Austin: Yeah. I wonder how much of that is because we have...maybe not a little more time but now that it's kind of like, oh, you've gotten back to something approaching like a schedule. Maybe you're going into an office, maybe you're not going into an office, but you probably have more downtime or dead air, or you get to the point where it's like I want to hear people that aren't my immediate family or the crushing silence of being alone. I don't know.
Jonan: [Laughs] It's all those things. It is all of those things. The social part, I absolutely think, applies in this case. The social piece is definitely part of it. I think it is also, in a huge way, a function of time. We are now able to own our days in a way that we haven't been able to in the past, right?
Austin: Yeah, it's very different, and I think it's a good different for most people.
Jonan: Yeah. I feel like we also certainly at New Relic and across the industry; a lot of companies fell into the trap of, oh, now I can make video meetings every minute of every day. We can do all of these meetings that used to be inconvenient because we had to walk to rooms and schedule around each other. And now it's like unlocking meeting hell, but that's calming down too. My life has certainly gotten better since things have mashed along.
Austin: It's all about figuring out the difference between remote work and distributed asynchronous work, right?
Jonan: Yeah, exactly. I have been really excited to see the world generally level up there. Actually, it's going to make the world a better place longer-term that we are not bound by butts in seats nine-to-five culture that frankly it was always broken in my opinion, but it's going to change I think. That's great.
Austin: Yeah, I agree.
Jonan: We are not here to talk about butts or seats. We are here to talk about something way better. Tell us what we're going to talk about today.
Austin: So today, we're going to talk about a little project called OpenTelemetry. And I guess we'll start off with the big news.
Austin: Certainly, by the time you're hearing this, the OpenTelemetry specification will have reached 1.0. We have released the thing.
Jonan: I am thrilled to hear that. OpenTelemetry is an awesome project. I've been in love since I first heard of this thing. I want to give my layman's explanation real fast, and then you can actually give us the detail of what this thing is. So companies like New Relic for a long time have been providing instrumentation via these agents where you install a little library, bundle add New Relic gem to your application, and then it starts reporting to the mothership, and we get all the data and put them on dashboards for you. But what has changed with OpenTelemetry is the way that the mothership communicates with the agents is now being standardized, which means that to a large degree, we're able to swap and replace parts of these systems that we're setting up to build our observability frameworks, and it's only better for everyone. All of us get to interoperate with other platforms as we come to have a standard that everyone can use. So OpenTelemetry has just reached 1.0 for the specification of what 1.0 OpenTelemetry support means and released a portion of that actual implementation in traces.
Austin: Yes. So you're broadly right. When I talk about this, I like to go back, and it's really appropriate to talk about this on a New Relic podcast. When I got into observability as an industry, I got my start, like a lot of people did, writing instrumentation. And what you saw is when you started looking at the instrumentation code for all of these different companies, if they open source to their agents or their APM libraries, what you would notice is there was a lot of similarities between the way that MySQL client gets instrumented, the way that Rails gets instrumented. You're doing the same thing. You're taking the same sort of data, and you're putting it into a slightly different format, and you're reporting it to a different endpoint. And so there was quite a bit of duplication of effort across the spectrum of vendors between New Relic and the data dogs and whoever else.
From an open-source organizational point of view, there's always this question of who's going to do the work? Who's going to actually put in the time and effort to do this, the implementation work for something like this? And I think some people were pleasantly surprised to see how quickly a lot of the vendors came on board and decided to say, "Hey, we support this. We're going to devote engineering time to it. We're going to take our existing agents and our existing instrumentation, and we're going to open-source this and donate to the project." It's not only good for end users because end-users, like you said, now there's a lingua franca; now there's a standard. Before, you could be using New Relic, and maybe another team wants to use something different. And now you have a problem because these two things can't communicate with each other.
Jonan: But for some portion, at least they overlap in functionality.
Jonan: And I can't get mine in theirs.
Austin: Yeah. You can't get your stuff in there. They can't get their stuff in yours. So maybe you're duplicating functionality, and you have two or more even different tracing systems running through shared components. It's a lot of additional stuff. OpenTelemetry, like you said, flips this on its head. And now we have a single standard for creating these traces, creating metrics. We have standards for attributes and tags and things like that, so there are semantic conventions. One of the big things in the specification is a set of semantic conventions about what labels should we give a hostname? What label should you give a function, a cloud function like a Lambda or whatnot? How should you represent an IP address? And so a big part of this spec is saying okay if you're writing instrumentation, you record the IP address of either the host or the incoming request or whatever. Here's the tag for it, and everyone will support that. It will take some time for vendors to catch up and for the analysis side of this to catch up, but now there's a specification we can point to and say, look, here's how it's going to be. And kind of by acclamation, everyone's going to agree that yes, that's how it is going to be. That allows us, as the people writing these tools, the vendors as it were, it lets us write better tools. It lets us unlock more analysis capability and actually provide more value, I think, to the people that are relying on this to solve problems in production.
Jonan: And it allows us to future-proof our products, and it allows us to coordinate our efforts and guarantee interoperability. I think it's going to proceed lightning-fast, much like the adoption of OpenTelemetry did across the industry. To think back to any standard, any specification that's ever been written in software, [Chuckles] I've never seen such momentum behind a thing. Everyone agreed because we've all been in walled garden mode forever, and suddenly, it makes perfect sense to companies that this is the way that we want to go. Or honestly, I think if you're in the observability space and you aren't thinking about OpenTelemetry and how it impacts your business every day, you're going to get left behind real fast, I mean years, months, maybe.
Austin: I think there's actually also a really good point here for people that maybe are in a situation where it's you hear all this, and you don't trust the messengers on some level, and I don't blame you because we're both -- I work for LightStep.
Jonan: I work for New Relic, but neither of us has a vested interest in the success of OpenTelemetry; that's not true. We both have exactly -- [Chuckles]
Austin: Yeah. But I think that's actually part of it, I would say. One of the things that I think you can look at and say this is actually pretty good for end-users is because there is a lot of cooperation between vendors. And I can guarantee you there is no secret conspiracy here because the salespeople at least all hate each other. No, I'm kidding.
Jonan: [Laughs] Yeah, as long as the salespeople -- yeah.
Austin: Yeah, as long as the salespeople want to make sure that they have something unique to talk about. But I think that what we've realized you can look at this as a charitable or uncharitable way. The charitable way is yes, this is better for end-users; that's the way that I would look at it. The uncharitable way is a lot of vendors are getting tired of effectively duplicating effort on these agents. And I think that's also true to an extent, right?
Jonan: There's a balance, right?
Jonan: You've got to make the claim to someone -- Corporations are designed for a very specific purpose, and that is to make money, and that's how it works. So you got to be able to explain the case for open source, which is good for the world and good for end-users from a financial perspective. And OpenTelemetry, I think, is very fortunate as a project to meet those perfectly in maybe more successful a way than any project before it, with the possible exception of Linux.
Austin: Yeah, definitely this makes a lot of people happy for a lot of reasons, and that's arguably why it's gotten the adoption it has already. I do think, though, if you are listening to this and you are skeptical for whatever reason, the real advantage I think to you is someone that maybe doesn't want to use a vendor solution is that you can take OpenTelemetry format data and you can put it wherever. That is also a specification that's a Protobuf that you can send it as JSON. You can put this into whatever data lake, whatever NoSQL data studio.
Jonan: Data oceans.
Austin: Data oceans.
Jonan: You can have a data abyss.
Austin: Yeah. Data Titanic. I don't know what you got.
Austin: But you can put all this raw data in there, and then you can query it. You can build your own dashboards. You can use completely open-source tooling for all of this, and that's fine. I hope to see more people building these really cool open-source uses for this because the nice thing is from the vendor's perspective, is if you find it valuable, if you find it useful, then that's stuff that everyone can contribute to, right?
Austin: Everyone is benefiting, and maybe you have teams that want a feature that a vendor offers, great. You can do just that little part. You can take that little chunk, and you can put it over there. And then you can have your big data abyss full of traces. We don't care, at least from a project perspective, what you do with the data. We just want to make it easy and standard for you to create that data and then move it somewhere.
Jonan: Exactly. The value for observability platforms like New Relic in the future is in the platform; it's not in the data coming in. Getting the data over the fence for a while was a competitive advantage conceivably. I used to work on the Ruby agent team. Don't go grepping for my name in there. I hope they've removed most of my code.
Jonan: But getting the data like the transport mechanism was not the important part ever really of what we did. And what has always been important for us is what you get once you put all of your data in one place inside one house, and you can have this world of observability unlocked for you with a single vendor like that. That's still there for you, and you're not bound to a single solution. And you can add OpenTelemetry into your existing Prometheus Grafana setup. And those teams that are using those tools can interact with the New Relic folks and all of the other vendors in the space. It just makes so much sense to have this common language. OpenTelemetry is the Esperanto of the telemetry business, but unlike Esperanto, it will be used by more than linguists.
Austin: This is true. One of the ways that we do that, I think with OpenTelemetry, is we've been very -- a big goal of the project from the jump is this idea of making sure that the API and the SDK are composable. So there's a big part of this where -- that was one of the flaws in prior work in this space. OpenCensus is a good example where they had this really good idea of, hey, here's the library, install it, and go. But if you needed something different, if you wanted to reimplement a certain part of it, that was really challenging. So with OpenTelemetry, the API and the SDK are decoupled. That means you, as a library author, you don't need to take a dependency on the SDK. You can take a dependency on the API, and then someone can download it and use a different version of the SDK.
Maybe you already have a telemetry system, a lot of people do custom metrics pipelines, a lot of people have custom tracing using correlation IDs, and things like that. So you can write extensions to OpenTelemetry that will allow OpenTelemetry and your existing tools to interoperate and actually upgrade that existing telemetry into OpenTelemetry format and then send it off somewhere else, send it to New Relic, send it to LightStep, send it to something new like Tempo or whatever is going to come out in the future.
Jonan: Jonan's and Austin's awesome observability platform launching soon.
Jonan: The value that we have had with our agents, I think, is now contributed back to the world across the board through things like OpenTelemetry. Most of us have open-sourced our agents anyway. I mean, they're still there for you; you can still go use my Mongo instrumentation that I wrote into the Ruby agent ages ago, and you can use it alongside everything else. It's only better for everyone. You mentioned OpenCensus briefly, and I want to talk about that because I think it's a great example of how these projects come to exist. There was OpenCensus and one other…OpenTracing? I always get the names wrong.
Austin: Yeah, OpenTracing.
Jonan: OpenCensus and OpenTracing had different corporate sponsors effectively. They had different heavyweights behind them.
Austin: So I think there was maybe -- to a certain extent, yes. But it was really a philosophical difference in how they were designed. And some of it is you can trace it back to -- OpenTracing came around first.
Jonan: Trace it back. [Laughs]
Austin: Yeah, you can trace it.
Jonan: Nailed it.
Austin: See what I did there?
Austin: One of the challenges, and I get this from talking to -- I'm a former maintainer on OpenTracing, and I have talked to a lot of people that have been there earlier than I was. And one of the things that they actually said talking about what we could have done better was there was this idea in a lot of people's heads when they were creating it that they weren't going to get adoption if they actually released it in implementation. So the basic idea with OpenTracing was, well, let's have a vendor-neutral API so that you can write your instrumentation once, and then as long as your tracing SDK supports OpenTracing bindings, then cool, you're good. So you could send them to Jaeger, or you could send it to Zipkin, or you could send it to New Relic, or you could send it to LightStep or whatever.
And I think one thing they learned over time was that the fears of oh, if we have of our own implementation people aren't going to actually use it, or it's going to create a lot of pushback from vendors, was actually maybe not the case, maybe that would have been a better idea to have a reference implementation. And I think since this was developed sort of not uncommonly with this but not exactly parallel but came out of this idea at Google that a lot of the original OpenTracing developers worked at Google, so there's a through-line. But their idea was the best way to handle instrumentation is to make it free for the end-user since this was actually originally a package that was distributed as part of gRPC at Google. And if you have a service at Google, it uses gRPC, so you would throw Census in and ta-da! You have tracing; you have metrics.
Jonan: This is why we're seeing the success of things like Pixie right now, this like eBPF instrumentation. I am excited about that stuff.
Austin: Yeah. That stuff is super cool.
Jonan: Right. Go on.
Austin: So I think what Census learned or what Census did was it really learned there's a ton of value in having as much stuff done for you as possible. You do need a reference implementation. One of the biggest questions I got when I was doing OpenTracing stuff all the time was people would email or tweet or ask me, "Well, how do I install this?" They would go to the website and be like, okay, I just want to download something, I want to click install, I want to gem install OpenTracing. And you couldn't do that because that would just give you an API that without bindings just kind of sat there. With Census, you could gem install OpenCensus, and suddenly you get traces, and suddenly you get metrics. The merger of these two projects, a lot of it was figuring out exactly where those lines needed to be because it is really important, I think, to the broader open source community, the audience of people that are writing libraries, right?
Austin: Or they're writing things that they want to run on someone else's machine. Those people need a lightweight target. They need just the API bindings. They don't want to actually distribute this whole heavy SDK that has protobuf dependencies and gRPC dependencies and this and that and the other; they just want one thing couple of kilobytes. But integrators and implementers and people that are actually running applications and services they do want that SDK. They don't want to hear, okay, first you install OpenTracing, and then you need to go install Jaeger, SDK, or LightStep tracer or Zipkin or whatever else. They want that click, download, install, boom, done.
Jonan: But they still want the extensibility.
Austin: Right. So when we combined these projects, we said, okay, let's keep the idea of decoupled API and SDK, and then let's add in this idea of tooling. So before you had -- Jaeger had a collector an agent sort of thing to proxy and forward traces. Various other tools have had similar things. The idea of the OpenTelemetry collector came from the OpenCensus service. And the idea there is, what if there was a binary that could act as an endpoint for your telemetry and process it, do some sort of lightweight processing? Because one of the things that's pretty common is the people that are instrumenting the code maybe are not the people that are responsible for the care and feeding of the data coming out of their code. It's like the SRE engineering divide.
Jonan: Yeah, which we're hopefully seeing breakdown. More and more, we're seeing it breakdown where people are [inaudible 21:20] at their own phases, but it's still a thing. And I'm not going to think if I'm an application developer and your SRE, I'm not going to think am I sending Austin too much data, or am I sending the right data, right?
Jonan: This gives you some control there.
Austin: Right. So the idea with the collector is that this is just a binary that you can deploy. You can run it as a sidecar. You can run a pool of them. You can do a lot of things. And it takes all this telemetry data it's coming from all your services and then lets you apply processors -- Basically, you can apply processors, and those processors can be anything from applying this regular expression when you see a certain label or a certain attribute. So you can do PII scrubbing, or you can even get a little wonky with it. One cool thing we did internally at a hackathon last year was we had someone write a plugin for the collector that would look at the library version of telemetry that was coming into it and then call out to a CVE, a checker service.
Austin: And then if it's like okay, Foobar 2.1.7 has a CVE, then I'm going to add an attribute to these spans so that when they see them in the thing, it gets flagged like, hey, you need to update this.
Austin: Hey, there's a security vulnerability here.
Jonan: This is brilliant.
Austin: Yeah. And that's the kind of thing that you can do with this. Since you see everything that's going through, you do whatever you want, like modify the telemetry. And best of all, now you're not locked behind a redeploy of the app because you need to -- This simplifies it from the app dev perspective a lot, I think, the service developer, because all you have to do is say okay, I install this thing and maybe there was some config that I need to get for it. But for the most part, you install it, and you're on your way. And then the people that are interested in this telemetry where it goes, what happens to it can add stuff in and then if they want to change where it goes, or they want to adjust sampling rate, or they want to do whatever, they just redeploy this service. They just say, okay, we're going to redeploy these sidecars boom, boom, boom. And now it's going somewhere else, or now we've changed it in some way.
Jonan: It's going to two places now. It's going wherever we want it.
Austin: Right. And you can use it that way. You can say I want to try this new open-source thing. I want to try this new tool. So I'm going to add a new destination, and then now 10% of my telemetry is also going to this new tool that we want to try out. It's a more cloud-native way of doing things, I think.
Jonan: It's exactly right. I don't think it's escaped anyone's attention that the cloud-native computing foundation since its inception has become a dominant force for decisions like this across the industry. They are driving by their existence a lot of change in observability. I think the question of which projects are going to make it out the other side is still to be determined. There's a lot on that map of the CNCF projects.
Austin: And there's new stuff every year, every few months.
Jonan: So much happening that it's hard to track, but that's actually a better thing I think in the long run for the world to be able to see it all in one place and develop standards and specifications like OpenTelemetry together, means that we get to choose our tools. There may be a day someday where you and I are using the entire OpenTelemetry stack down to our applications in exactly the same way, but I'm using Jonan's platform, and you're using Austin's platform, and it just comes down to a personal preference of how I want to see the thing displayed.
Austin: And I think the other good thing about or the foundation, at least, in this case, is that there can be a definite virtuous cycle in the way that a lot of this adoption goes. So, for example, I believe it's going to be in Kubernetes 1.21 now, but API server tracing is going in. So I think it's going to be in beta, and it might've gotten pushed to 1.22. I'd have to go look through the KEPs. But the idea there is whenever you do anything Kubernetes, it's going through the API server: Do you want to scale up a Pod? Do you want to add or remove a node? Do you want to apply a CRD? Whatever. It's calling the API server. So now, all of those requests will be traceable, and they're using OpenTelemetry to do it. And they're using the OpenTelemetry collector. You deploy that onto there, and then you can send those traces wherever you want. Anyone that speaks OpenTelemetry format, you can start to trace your Kubernetes cluster.
Jonan: Help me explain this piece. Someone told me the other day that the one thing that wouldn't be possible with eBPF as far as instrumentation goes -- because it's kernel layer instrumentation. We're in the kernel, and eBPF doesn't have a complete set of functionality yet, but they're adding to it. And over time, it's conceivable then that this could be a way to instrument almost everything in your code except for tracing. And I think that's because you would have to be -- Tracing is fundamentally – you're following a request from one application to another or one call across an API to another with a header, putting a header thing in there. And you couldn't do that with your eBPF code because you're not actually modifying your reading.
Austin: My layman's understanding of eVP...EPP..eBPF
Jonan: It's a lovely name, isn't it?
Austin: Yeah. My layman's understanding of eBPF is it's similar to syscall tracing in a lot of ways. And so you're just looking at syscalls, and you're looking at what the kernel is being asked to do through some layer of abstraction. So there are two ways to describe this. And we've been both really fast and loose with the terminology. But when I've been saying tracing, I've been referring to distributed tracing and, more specifically, to Dapper style or span-based distributed tracing as popularized by Google and written about in the Dapper paper. And that was a planet-scale distributed tracing system that worked in the way you said, and it propagated context. So it would let you know what requests you're part of by sending headers across the wire between RPCs, and then the receiving service would pull that context out and create a new event or span or whatever you want to call it that encapsulated the work being done in that. And then all of those spans independently are collected and displayed and graphed and analyzed and dah, dah, dah. But then there's also tracing in the sense of eBPF or syscall tracing or function level tracing through the JVM or the CLR or through the Python interpreter or whatever where you actually --
Jonan: Like you get with the stack trace. When you're doing application-level tracing, it's a stack trace we're following.
Austin: Yeah, exactly. I don't know if distributed tracing is the best name for what distributed tracing is. This is a little like, I don't want to say out there, but it's a little sidebar.
Jonan: Sure. They should have called it eBPFF [inaudible 28:58] if they wanted.
Austin: Yeah, obviously. A lot of people know this by the term APM because that's what it's marketed as in most companies. And the rationale for that is whatever the rationale is. There's probably actually a very interesting conversation to be had with a lot of people about why did APM get called APM.
Jonan: Marketers be branding is my impression.
Austin: That is true. We do be branding. But I don't actually have a good name for distributed tracing that isn't distributed tracing because I don't think APM is really a good one because APM is actually overloaded between application performance monitoring, application performance management, whatever. And you're actually not interested in the application performance, right?
Austin: When you get down to brass tacks when we talk about tracing as a part of observability, what are we actually interested in? Well, we're interested in end-user experience. We're interested in --
Austin: Yeah, SLOs SLIs. We're talking about I have 2 million services running acting in concert in some fashion, and because I have this huge level of abstraction and complexity, I need to be able to get pretty discreet information about my server's performance but not as discreet as just metrics or logs would give me because metrics are too big. You can't get the resolution you need due to cardinality. Logs are too fine-grained and hard to go through, even if you're using structured logging—tracing kind of fills in this middle part here. The way I prefer to think of it is a trace is really just a log statement. It's a structured logging statement when you get down to brass tacks. And it's a structured logging statement that has a context mechanism attached to it. And if you look at OpenTelemetry and you look at how OpenTelemetry is designed, the bottom layer on an OpenTelemetry like architecture diagram is actually a component called context.
Austin: And it is this idea of distributed context for services. And it's important because you can -- Let's talk about the future a little bit. I know we're getting on in minutes here. So I want to kind of --
Jonan: Oh no, I'm good. If I was listening to this podcast, would you turn it off? This is great.
Austin: That's true. I wouldn't. No, this is great stuff. But I do want to talk a little bit about the future too because it is really important I think.
Jonan: All right. I'm going to teleport us to the future with my soundboard. Are you ready?
Jonan: I had a sound for the future world. We just came in.
Austin: Ooh, I like it. It does feel very futuristic. So let's talk about the future because the struggle that I always hear about people in the present is that "Austin, this all sounds great, but gosh, it's so much work. It's a lot of effort to integrate this. Even with OpenTelemetry being what it is, it's a lot of effort, and time is our most precious resource." So I believe the future of OpenTelemetry really revolves around two things, and one piece is that context I was just talking about. The idea of a broadly supported distributed context library that happens to be polyglot and uses standardized like actual W3C standardized headers is actually super useful because you don't have to just use context for observability. You can use context for a lot of things. You can throw whatever you want into a header and send it down the line. So that gives you a lot of interesting opportunities. For example, right now, let's say you use feature flags a lot. Well, every one of your services that is using a flag or has flagged behavior probably has to independently communicate with the flag server. It has fetched the variations and said this is what I should be doing right now based on this other thing. But what if there just happened to be a single well-supported distributed context system that allowed you to add in specific information to it, like the value of a feature flag? So you could have a single layer in your architecture may be sitting in front of the back end sitting wherever that was just like, okay, I know this request ID, and I know some other stuff because the front end told me some other stuff about it and put it in this context. So my feature flag layer is now going to do one check and come back with some opaque value that I can throw into this context, and then services that are aware of this flag will see it and modify their behavior. And I don't have to have this big overhead of --
Jonan: Contacting a server...
Austin: Contacting a server, having other external API called dah dah.
Jonan: This is brilliant. I can embed, and it's distributed. I don't have to think about this other than -- I'm no longer calling out to another system, but it's actually just the tip of what is possible. The context, in some ways, is just an anything bucket, really.
Austin: Yeah, it's an anything bucket that we use in OpenTelemetry for observability context, and observability context is primarily obviously like per request ID use. So you have a trace, and that trace encapsulates all the work being done by a single request. But then, if you start combining these ideas, it's like, oh okay, I know every single request has a unique ID. I can throw whatever I want into the big bucket of distributed context. So now I can start to build correlations between all of my different things. So not only do I have a trace that shows me what services participate in this request, how long did they take? Was there an error? Dah dah dah. But I can also start associating individual metric measurements because when a measurement gets created, it can just be tagged to that context. Now yes, that's a high cardinality, a super high cardinality value but possible.
Jonan: And we're building more and more systems to support higher cardinality in our data right now.
Austin: Right. So you could imagine some sort of metrics dashboard that understands context, and maybe we do this through exemplars maybe we do this through a lot of fancy things you could do here. But you can imagine some things where it's like, okay, I'm hovering over my dashboard, I see my spike, I click on it, and then it just automatically knows what the trace exemplars are for that because it actually has those traces and memory somewhere or in storage somewhere, and it's looking at that same context. You can then imagine this to also apply to my logging output, but then you take it a step further and say, well, my feature flag obviously knows about this because I'm passing feature flag status in this distributed context. So those feature flags are also implicitly tied to this idea of a single distributed context. My business analytics are also tied to this because when those are generated, those also have that context ID in there. So now I'm actually pulling in all sorts of stuff. It's not just like a rate error duration. It's not just my three golden signals and my SLIs. It's like everything about marketing campaigns that are targeting this person. It's a status based on feature flags, its timing, metrics, and other things that the businessy people want to know. So if someone's having a bad user experience, I can integrate all this data together and form a relationship between it as an inherent part of the specification.
Jonan: You get to include any detail you can imagine about what that specific user experienced for this request path as it traverses your entire system.
Austin: Yeah. And then you could preserve that. You can throw all this in your data abyss, and then end of the quarter, start to mine that for interesting insights. So you could say one of the things that I think anyone that has -- If you're listening to this and you've done data analysis or business data analysis, the questions are always like, well, we see that. Why didn't we hit our conversion rate, or why didn't we have drop-off in this funnel? Not only could you be able to pretty easily say well, the average latency for people that dropped out of the funnel up to that point was blah; you could even narrow it down to because they were on iPhone 12s and they were in Lyon. It's about giving you a wealth of data, and we don't tell you how to use it. We don't actually do a lot of stuff for you today, but we give you the building blocks that maybe we're not going to get to in OpenTelemetry. Maybe we're going to get to tracing metrics logs and be happy for a while. But we have this spec; we have these standards. People can come after us and build on our shoulders. And so I can see a future where you as a developer as an app dev, as an SRE, whatever, you don't even think about this because it's already there. You fire up Kubernetes, and you have traces for everything just because of traffic passing between your pods or you install express, and it's traced automatically because it integrates OpenTelemetry, or you have your Rails app, and ta-da everything is traced. And then you throw in your analytics library, and that just picks up this context.
Jonan: And it's all the best that can be offered because it's implemented by the maintainers of those frameworks. The Rails team will build that. They'll decide how to get this instrumentation in there in a native way. So it's not me writing a Ruby agent and hacking our way into the system. We're overriding classes and adding methods into places.
Austin: And it should all be better performant or more performant.
Jonan: It should all be better performant and more organized. The context allows you to find -- Some of the most difficult to debug things that I've ever encountered in my career came from exactly the kinds of things that a context solves. When I had someone who had forked Rack, for example but used the exact same versioning scheme as the Rack -- Don't do that, by the way.
Austin: Yeah, I agree.
Jonan: Don't create a Rack [inaudible 39:08] .7 because it makes it really hard for someone trying to instrument your code, but those kinds of problems are all going to be fixed up by this in ways that are built into the systems we're deploying. Yeah, Kubernetes gets a lot of criticism today, but I think it's pretty clear that it's hitting the world real fast, and there are reasons. And they go beyond just this. It has momentum, but that's a huge part of it. And that momentum is going to continue building around projects like OpenTelemetry and Kubernetes to a place where we get this out of the box, and then it just becomes a question of what do we do with it then? What are we going to be able to do? We get all of this data together in the future where we are now. I forgot I used my soundboard to teleport us.
Austin: Yeah, we're in the future.
Jonan: Here in the future where computers are not terrible anymore. We actually solved it. Anyone there in the past -- Don't worry.
Austin: We fixed it. Come to the future. Computers are good now, we swear. I think another cool application of this is in newer languages because what you're seeing with Erlang I think as an example and Rust is also a cool example of this is because they're being designed and built and iterated on in a world where requests tracing APM, distributed tracing whatever where that is actually both highly available and usable and paramount. So a lot of those languages, one of the big things they provide is a better debugging experience and a better way of handling stack traces, and when something breaks, what happens. So now that these things are being designed and with the idea that hey yeah, distributed tracing is going to be a part of this, those contexts can be linked so that when you are looking at your stack trace, it knows if it's part of a trace, or you can actually jump from a trace to a core dump or whatever. That's kind of eventually how you can get sort of that eBPF Pixie to trace relationship by lower-level systems and lower-level abstractions, becoming aware of what's going on at the higher level. I think it's really important to like -- This is one of those things where it's going to interoperate; it's going to exist together.
But I don't think we're going to go to a place where it's like, oh, nobody uses distributed tracing or whatever because they're using eBPF. I think it's a thing of these are different levels of -- There's a saying of, you want to instrument one layer below the thing you want to observe. So if you're an app dev, maybe you don't necessarily care about all of the fiddly bits down here, but you really care about for the hot path of this code going through all this, like what's going on? Well, the best place to instrument that is going to be at your RPC layer because you have a one-to-one relationship between RPC and whatever. Now, eventually, if there's something that actually takes that, like, oh okay, I see this RPC is happening, I see this context. I'm going to grab a handle on that and then have this much lower-level stuff that can be associated; cool, that's even better. But you don't have to worry about one of these things completely subsuming the other. It's more that they all work together in the future.
Jonan: It is magical in the future, and it's going to be a really interesting time to be working in software generally but especially in observability. I'm so glad that you came on the show today to talk to me about these things.
Austin: Thanks for having me.
Jonan: We're going to rename this Austin McAustinface, and it's just going to be you and I chatting about nerd stuff all day. It's really fun.
Austin: All right. I'm always down to podcast, Man.
Jonan: Yeah, I really enjoyed it. Thank you again for coming on the show. The big news today, OpenTelemetry specification just hit 1.0 along with the tracing implementation of that.
Austin: And that's available; by the time you hear this, you should have released candidates for Java, .NET, hopefully, Go, and Python as well, with some others coming soon like Ruby, Rust, Node.js. Did I say Node already?
Jonan: Node is already in the bucket. You're going to have -- It's all coming soon.
Jonan: I can't wait. Happy OpenTelemetry Day, y’all! I hope you enjoyed the episode. We'll talk to you again soon.
Jonan: Thank you so much for joining us. We really appreciate it. You can find the show notes for this episode along with all of the rest of The Relicans podcasts on therelicans.com. In fact, most anything The Relicans get up to online will be on that site. Right now, we're running a hackathon in partnership with dev.to called Hack the Planet, where we're giving away $20,000 in cash prizes along with many other fabulous gifts simply for participating. We would love to have you join us. You'll also find news there shortly of FutureStack, our upcoming conference here at New Relic. The call for papers for FutureStack is still open until February 19th. I encourage you to stop by and submit a proposal. We would love to have you join us. We'll see you next week.