Jonan Scheffler interviews Observability Economist, SLOgician, and Zendesk Site Reliability Engineer Fred Moyer about being a “SLOgician,” which he explains is someone who works on SLOs from a statistics perspective.
Additionally, a lot of his work is being what he describes as an “observability economist.” Fred works a lot with Zendesk’s observability systems. And because they’ve been growing so rapidly as a company, they've been scaling rapidly as well. Therefore, he works to ensure that they’re using their observability economically.
Should you find a burning need to share your thoughts or rants about the show, please spray them at firstname.lastname@example.org. While you're going to all the trouble of shipping us some bytes, please consider taking a moment to let us know what you'd like to hear on the show in the future. Despite the all-caps flaming you will receive in response, please know that we are sincerely interested in your feedback; we aim to appease. Follow us on the Twitters: @ObservyMcObserv.
Jonan Scheffler: Hello and welcome back to Observy McObservface, proudly brought to you by New Relic's Developer Relations team, The Relicans. Observy is about observability in something a bit more than the traditional sense. It's often about technology and tools that we use to gain visibility into our systems. But it is also about people because, fundamentally, software is about people. You can think of Observy as something of an observability variety show where we will apply systems thinking and think critically about challenges across our entire industry, and we very much look forward to having you join us. You can find the show notes for this episode along with all of The Relicans podcasts on developer.newrelic.com/podcasts. We're so pleased to have you here this week. Enjoy the show.
I am Jonan. I'm on the Developer Relations team here at New Relic. And I am excited to tell you that we have a conference coming up where The Relicans have an entire track all to themselves, and we will be getting up to all sorts of nonsense. You really should be there. We're using a new platform, a new virtual event platform called Skittish that is created by Andy Baio, the founder of the XOXO conference, a very well-known conference up in Portland. It is a brilliant platform. Andy has a tremendous amount of experience in creating engaging events, and we're really excited to use it. You get to be cute animals wandering about in a virtual arena of some variety. We're using an island because our track is called Nerd Island and uses proximal audio. So you walk up to each other, and you can hear people as though you were in a hallway track. And you can watch the talks right there on these floating virtual screens. You get to select your very own animal from a wide array of virtual avatars. So I encourage you to stop by. Registration for the event is free. You can read more about it at therelicans.com/futurestack, and I would very much like to see you all there. So with that, let's kick off today's episode. I'm very pleased to welcome my guest, Fred Moyer. How are you, Fred?
Fred Moyer: Hi, Jonan. I'm excited to be here on Observy McObservface.
Jonan: It's a really exciting podcast to join despite the very silly title. I wonder someday if I'm going to reach out to someone who is a very serious business person, and they'll look at the title and just decline. Do you think that I have that to worry about?
Fred: Not unless you suggest hosting the actual podcast on the Boaty McBoatface boat, and perhaps they might --
Jonan: [laughs] That would actually probably limit my chances, I think. Boaty McBoatface is probably not an ideal acoustic environment for creating a podcast. So, Fred, tell me about you. I'm sure our listeners are curious to know who it is that you are.
Fred: So I'm an SRE at Zendesk, a company that makes awesome customer support software. And my roles there are primarily of what I call a SLOgician, which is someone who works on SLOs and from a statistics perspective. And also, a lot of my work has been what I describe as an observability economist. I work a lot with our observability systems. And since Zendesk has been growing rapidly, we've been scaling rapidly, so I work to ensure that we're using our observability economically.
Jonan: Observability economist and SLOgician. And you make sure that you're using your observability tooling economically, that you are selecting wisely the data that you choose to include.
Fred: That's right because we've got upwards of, I think, 1200 engineers, 150 teams, hundreds of services. And as you can imagine, that generates quite a bit of telemetry. And you can't just take an endless amount of logs and throw them at a monitoring provider because it's going to really hurt your wallet. So I work to ensure that we're sending the right telemetry, that we're optimizing our use of telemetry, and giving our engineers what they need.
Jonan: Without necessarily just letting them record everything they can possibly think of, which is exactly what I would do. I would just put a little bit of telemetry in on every other line, trace every method, and log all the time so that I have all of the data upfront.
Fred: Not a surprise that most engineers do that as well.
Jonan: Well, the goal for me, which is I think the objective of observability, is to make sure that I have the data before I have the problem so that when it actually comes along -- But you have a very interesting challenge from that perspective in that you need to maintain observability, the ideal of observability, and actually limit the amount of data that you're putting into the systems. What kinds of things do you think people commonly track that maybe are less necessary? I guess that's a hard question.
Fred: Obviously, with logging, there are different levels of logs. Debug logs can be very useful in some situations when you have to do deep forensics. But those aren't always things that we necessarily want to index for everyday search. So we'll put those off at our cluster layer and archive those in S3, and we'll send the more heavily used logs to our monitoring provider for analysis.
Jonan: So you actually are able to split the logging out, and then you can have the debug logs in case you need them.
Fred: In case we need them, yes.
Jonan: They're just more difficult to retrieve and to search.
Fred: Yeah. So it's just a little bit more of cold storage for those.
Jonan: Do you actually use Glacier or use S3?
Fred: We use S3.
Jonan: The cold storage reminded me that Glacier is a product that I've never actually used because I am very put off by the idea that I have to pay you more to get my data back.
Fred: That's a little odd. Yes.
Jonan: Yeah, but I'm sure there are a lot of people who find great use for that product. So at Zendesk, which I agree is fantastic and every company that I've worked for has used -- I'm sure there are other providers of support systems, but I don't want to find out about them. I actually own a Zendesk mug. It has an adorable monk sitting on it.
Fred: I was a customer of Zendesk for a startup I did in 2009. So now I've come full circle working for them.
Jonan: That's awesome. What was the startup about? I'm curious.
Fred: It was a startup called Silver Lining Networks where I developed some technology to advertise on Wi-Fi networks. So I don't know if you've ever on a plane flight used the Internet where they displayed an ad bar above the web pages.
Jonan: Oh yeah.
Fred: I developed some really cool technology to do that, but it turned out that the market just wasn't there to support it. The network operators loved it. The users did not love it as much.
Jonan: I can imagine users being skeptical about it, but given the option to have free Wi-Fi, I'm sure many people would opt for that. On the plane it's expensive. I want to say it's like $20 a flight or something.
Fred: Yeah. So we had a little X box in the ad that you could close the ad, and that turned out to generate probably 90% of our clicks.
Jonan: Oh, really?
Jonan: And so if you make the X smaller, then you get some actual ad clicks out of the deal until it's just a single pixel one. That's cool. That's an interesting technology. And so you have a lot of experience with networking, then I imagine being able to achieve that.
Fred: Yeah, my experience there is very deep in some areas but not wide like a network engineer would have.
Jonan: And then other than that, you've mostly focused on observability. I did some research on you, and I hope you don't mind, but I was on your Twitter profile and saw that you are into time-series databases generally. Tell me about that.
Fred: I am. So I was lucky to work at a company called Circonus for a couple of years that developed a really awesome time-series database called IRONdb. And before then, I was a user of their product and other monitoring products. But that really exposed me to the insides of time series databases and all the cool things that you could do with them,
Jonan: I guess it's a really interesting problem set around storing time-series data. You have very unique challenges as opposed to the data that you are typically storing with your application. How did IRONdb work? It's still a product, I'm assuming.
Fred: It is, yes. It was written in highly optimized C, so it was extremely fast and could scale very far. It used data structures called CRDTs, conflict-free replicated data types so that, unlike a traditional distributed system where you need a consensus protocol like Raft or Paxos, this, because it used CRDTs didn't need that consensus protocol, which is often the limiting factor to scaling the database or distributed system.
Jonan: I want to make sure that I am clarifying things both for myself and for our listeners. The consensus protocol, in this case, would be obligated with making sure that you have consistency across your data so that you're trading away some of the ACID pieces with the time series database very often, right?
Fred: Yeah. You're trading away some availability for consistency.
Jonan: For consistency. And so you have a consensus protocol in some cases that is responsible for what in some competitors of IRONdb?
Fred: For ensuring that all the nodes agree on some data that they've ingested. One good example is Apache's Zookeeper, which is used to obtain consensus among Kafka cluster nodes.
Jonan: And so when I am a Kafka cluster node, and I take a message off of the queue and finish my work on it, I'm letting someone know, and other people also know.
Fred: Well, I'm not an expert in Kafka, but essentially the consensus protocol allows the distributed system to elect leaders, which ensures that there's consensus among those members.
Jonan: Okay. This election step is the interesting part of a consensus protocol, which is entirely unnecessary with IRONdb because of the way these data structures work.
Fred: That's right. It doesn't have the piece which turns out to be at scale, one of the more difficult pieces to make reliable. So you eliminate that hard problem completely.
Jonan: And this is still a relational database?
Fred: It's a time-series database.
Jonan: It's a time-series database. But is it structured data? Is there a schema?
Fred: There's not really a schema per se. But in the case of IRONdb, it would ingest, and like all other times-series databases, you ingest metrics. So you have a key, the metric name, a value, and some metric tags associated with it.
Jonan: And then you have a single value for any given metric you record. Have you heard of TimescaleDB, the Postgres-based one?
Fred: I have, yes. Very cool stuff.
Jonan: Which is one of few examples that is using a relational database to achieve this thing. I know that there were some advantages to their product that had to do with the cardinality of data as opposed to some other solutions but that you didn't maybe want to store so much breadth in your individual metrics. But IRONdb would presumably account for that with these data structures.
Fred: It's going to store the same type of metric structures. But Timescale, I believe (I'm not an expert in that either), is an extension to PostgreSQL, so that allows you to store time-series data within a Postgres database due to the Hypertable.
Fred: I met the founder, Ajay, and that's actually very cool stuff they have going there.
Jonan: I actually had an opportunity to play with Timescale at length and quite enjoyed the product. But time series databases are just generally fascinating to me. I was here at New Relic during the period of time when they were developing the time-series database that axed some of our products, what used to be known as Insights and is now part of the New Relic One platform generally. There are some very complex problems to solve. Most of the time, if you're working at a software company and someone comes to you and is like, "What we're going to do is we're going to write a database." You're like, "Wrong answer." No matter what the question was: wrong answer. But that's uniquely not true in observability spaces, I feel like.
Fred: Yeah, it's a question that if you find yourself asking it, you already tried a number of different approaches and really have a problem that's a little bit different, significantly different enough from the options that are out there to warrant doing it.
Jonan: Yeah, absolutely. So you have been focused on the SLO work at there at Zendesk.
Fred: I have.
Jonan: And this is a service-level objective, and that concludes the entirety of my knowledge. Well, not really. But I'm sure as someone who describes themselves as a SLOgician; you have much more depth there. Maybe explain to our audience what it is we're talking about.
Fred: Sure. So an SLO is a service-level objective. Everyone's probably heard of service-level agreements. SLAs are legally binding contracts with customers about the availability of a given service that you're providing, whereas an SLO is more of an internal target for your engineering organization that is less stringent than your SLA. So you want to make sure that you're hitting your SLO so that you don't want to -- If you breach your SLO, then you are at risk of breaching your SLA. So it creates this buffer zone between the SLO and the SLA, like a DMZ that allows you to provide service that's great for your customers without breaching the legal obligations.
Jonan: And so you're not legally bound to anyone. They don't make your engineers sign contracts.
Fred: But it's an internal target for your systems.
Jonan: For your system. So you internally commit to five nines. Is that an appropriate number of nines?
Fred: I don't know many folks besides telecoms that go that far.
Fred: But we have our own internal targets, and we set those up for each of our services for specific endpoints. And it's really product-driven. Product managers have a lot of input on the targets that we need to meet there.
Jonan: And so as a SLOgician then, what is your role in that process? Do you get to beat the engineers about the head when you miss a target?
Fred: That's not my role now. [chuckles]
Jonan: No? Okay.
Fred: So when I came aboard Zendesk about two years ago, they were in the process of trying to roll out SLOs and corresponding error budgets across the entire organization of, at the time, about 1,000 engineers, which is more than a few. And so, I'd been focusing heavily on understanding SLOs and our budgets deeply at that time. And so my function was to essentially democratize SLOs and error budgets, figure out a way to explain these concepts to that wide set of engineers that could be easily understood. So essentially, I wanted to translate those into formulas that they could use to take certain metrics, plug those into a formula for an SLA, create a structured SLO from that, and from there develop an error budget.
Jonan: And then their error budget would be used to target their actual engineering work.
Fred: That's right. So depending on, say, if your SLO is -- you want to have a certain service hit 99.9% availability for seven days, and the seven days part is important and one that a lot of folks leave out with SLOs, then you have you're allowing yourself 0.1% of errors over those seven days. As you accumulate errors over that seven-day period, you're going to consume parts of that error budget. And when you hit, say, 50% to 80% of that 0.1% overall error budget, that's where you want to shift your focus to reliability work because that consumption of your error budget to some degree has used up customers' goodwill. And rather than exhausting your error budget, you want to stop focusing on features for a little bit and focus on reliability. So error budgets are really used to prioritize reliability work.
Jonan: And you recommended 50% to 70% as being the time when you look up, and you think okay, we need to pause whatever new feature development work we're doing right now and really focus on solving this issue in front of us. That 0.1% that we're exceeding we're putting ourselves at risk over this period that we've agreed upon for our SLO. So this is a built-in trigger to set off some refactoring or some reliability work.
Fred: Yeah. So we have monitors with our monitoring provider Datadog that will track this error budget consumption. Fifty percent to 70% or 80% percent is the rough number. Yours might vary depending on how fast you consume that error budget. If you're consuming it very quickly, you might say, "Hey, we're going to run out of our error budget in two days. We need to stop now," versus if you're consuming it fairly linearly over that seven day period, you might have a little bit more time to shift your focus.
Jonan: I want to also ask you about the piece that you said a lot of people leave out, the timing. That very often someone will make an SLO, and they will neglect to put per 7 days on there, so 99.9% uptime per 7 days. The timing is important, of course. But what is the alternative? What would people do instead?
Fred: They'll focus on the two or three or four nines piece and forget to specify what time period. If I say, I want my service to be 99.9% available, over what time period is that? Is that over an hour? Is that over a month or over a year? Because the dynamics of that really drive how you approach reliability.
Jonan: And some people could conceivably make this a yearly goal.
Fred: They could.
Jonan: Or even quarterly. And then, somehow, the fiscal year is tied to the reliability of your services, which should be entirely unrelated concepts.
Fred: And often, folks will have -- I suggest to our teams to have multiple SLOs. You could have three nines and have that for a week, have that for a month, and have that four a quarter, and you could adjust the percentage. I call that the service objective, whatever it is, two or three nines. And you can adjust that differently for a week or for a month or for a year.
Jonan: That in any year you want to have three nines. But in any given week, you'll settle for one.
Fred: That could be, yeah. You can spread out the reliability. You don't want to consume all of your customers' goodwill in one quarter and then be perfect for the rest of the year.
Jonan: Right. It's not necessarily a fair thing to spread that -- You don't get to amortize quite the same way when you're talking about goodwill, right?
Fred: No. And a lot of this is very subjective, and it's a little bit of an art form and comes with feedback from customers. And I call that calibrating your SLO or your error budget because if you're hitting, say, three, three and a half nines and your customers are unhappy, either you're measuring something incorrectly, or you have customers with, in some cases, unrealistic expectations, or there's some part of your process that's off. So you have to calibrate that subjective part, which is customer happiness, with the objective part, which is the measurement of the SLO and then the error budget.
Jonan: And so in some cases, you have teams that -- I mean, I imagine a scenario where the business leadership has certain ideas about what is a reasonable expectation, and the engineering teams have a different idea about what is a reasonable expectation. Reconciling those, I think, is maybe one of the more complicated problems to solve in software.
Jonan: Do you have any advice for people about how to help?
Fred: So what I suggest there is SLIs which are service level indicators that drive SLOs. Those are really what engineers have control over. And an example SLI might be I want my home page request to return in under 500 milliseconds. That's something that an engineer can test out in development, in staging, and then they can say, "That's a reasonable technical goal." And then, I can take that SLI and attach a 99.9% availability and a seven-day time window to that. And at that point, it becomes an operational target as opposed to more of a development/technical target. And that's where a product manager is going to decide if this is reasonable. And so the product manager is stuck in the middle there between the engineers trying to say, "How fast can we make it?" And the exec team saying, "We need it to be this reliable." And so they are the ones brokering those trade-offs
Jonan: And the SLIs driving the process from what I'm understanding, you have the engineers coming in saying, "This is a reasonable technical target," that 500 milliseconds to load the page then rolling those up into an SLO and then negotiating in between leadership and the engineering teams through a product manager.
Fred: You have product managers there, and they're the ones who are telling the executive folks, "Well, if you want more reliability, a higher success objective to meet, we have to spend more time on reliability work and less time on future work."
Jonan: And being able to negotiate that trade-off in real terms being able to say, "Look, the home page is not fast enough right now. We have exceeded this SLI that we set, and that's why we end up hitting, or we end up violating this SLO, and so we can focus on that work. I've talked to the project team with two solid sprints. They feel like they could knock this out, get that number down. It involves some pretty intense work. Are we willing to commit to make that change right now, or would you like to focus somewhere else?"
Fred: That's right. Do you want to focus on features to sell more software, or do you want to focus on reliability to ensure existing customers are happy? That's always been the big trade-off with software or SaaS companies.
Jonan: I often find myself making the case to new engineers. I talk to a lot of people coming through code schools and bootcamps, and I talk to them about a lot of things. Having come from a non-traditional background myself, I have a lot of thoughts there. But one of the things I often advise them to do is to build in their refactoring and their reliability work to their estimates. I consider that the expectation of a professional software engineer. If you are going to tell me how long it takes to build a feature, tell me instead how long it takes to build that feature in a reliable and sustainable way that doesn't slow us down. And I think that introducing technical debt because you are trying to move rapidly and develop a feature to get the approval of business leadership, while it may lead to short-term gains, it's a net loss for the company and the product. From that perspective, these SLIs and maintaining them -- I can imagine a world where I have to keep the home page loading time under 500 milliseconds. And I can look at that and make sure that I don't violate that. That's a pretty easy thing to test for. But there are more complex SLIs that are more difficult to test, and maybe you don't know that you're violating them until they're in production.
Fred: Yeah, that's really where the success objectives come in. You can test something out in development, and it works fine, but then you put it under a production workload, and you actually get a much richer view of how that's behaving. Whether it goes into an account with one user or it goes into an enterprise-level account with 10,000 users, that's going to behave very differently because you've got databases driving that which have indexes that behave differently at scale. So there are a number of factors once that hits production.
Jonan: It's bold of you to assume that I use database indexes.
Jonan: I appreciate that credit you're giving me. So I often like to ask people to try and make a prediction for the industry. I think, given your role, you are uniquely suited to talk a little bit about the direction that these sorts of things are headed. I think that there's pretty wide variability in terms of implementation of these strategies of observability and DevOps methodologies across the industry. What sort of prediction might you make that if we were to revisit this conversation in a year, you feel pretty confident you would have won the point?
Fred: Well, that's a tough question. I'm going to have to go with observability and monitoring vendors providing SLOs and error budget-specific features. I've seen that emerge with a couple of vendors. Now, one of the challenges there is when I started this journey at Zendesk, implementing SLOs and error budgets, I read every single book. I read the Google SRE book. I looked through their workbook; I watched videos. And while they do a great job of giving an overview, the implementation details vary a little bit between each of those sources. And really, what I'd like to see there is something standardized because when you're implementing these SLOs and error budgets, these are contracts that you have essentially with your company and with your users. And when you apply them at scale, that's really where you want these to be as accurate as possible. Because if you're going to shift from feature work to reliability work, that's an expensive shift. Engineers are expensive resources. And so you want to make sure that your measurements and what you're looking at there in terms of analysis are on point and as accurate as possible. And that's where I do see these vendors providing these features in their products. But the implementation details around those are slightly different. So I'd like to see those coalesce. I think they will at some point. We may not see that in a year, but I do think we'll see the proliferation of these features.
Jonan: And you're looking for almost something like OpenTelemetry standard for --
Fred: I was thinking about that.
Jonan: So if we're making a feature request to New Relic's product team right now live on the air, what is that feature specifically? You could tell them to build one thing, and they'll actually do it because I'm going to send them this episode.
Fred: Challenge accepted. At SREcon in December, I gave a talk about the exact formulas that we use to structure SLOs and error budgets. Take those formulas, some of them are built around the Google SRE book for implementation suggestions, take those, implement those as SLO and error budget features in your product and make those available so that it's as easy as possible for engineers and developers, or as we can say practitioners in the industry, to get highly accurate SLOs and error budgets for their systems.
Jonan: I'm in. You sold me on it. I'm going to go to the product team and just -- I actually have very little sway in this equation. But as soon as they make me Chief Product Officer, I'm on it. Actually, one of the primary roles of DevRel is exactly this thing that we are able to surface these sorts of feature suggestions and take them directly to the people who may actually end up implementing them. So I'm very hopeful that a year from now we have this. And failing that, I'll just build it myself because I'm definitely qualified to get into the production code and just muck about until it works.
Fred: Well, yeah, that's what I'd like to see because as a SLOgician, my job is to go out and make that stuff simple. And I'd like companies out there to put me out of a job so I can focus on other fun things.
Jonan: I very rarely plug our own product here, but I will say that I'm quite pleased to say that I have an idea of how I could build this thing myself because I can just use a Nerdlet to do it because we have this capacity to build custom widgets into the platform all the time. It's an open development platform. So I could totally go and write some React to do this thing. I might actually try it if I can't get them to build your feature. So with that, we have another question that I very often ask, which is that there are likely people listening to this episode today who aspire to be in your shoes someday, the up-and-coming SLOgicians of the future, and I wonder what advice you might have for them.
Fred: I'd say start by reading the Google SRE book, read the workbook, read the service-level objectives book. Go watch SLIs, SLOs, SLAs, oh my! videos by Liz Fong-Jones and Seth Vargo. Go watch any SLO-related SREcon conference talk. Disclaimer: I've given a few of those. Attend the upcoming SLO conference; I think that's in a few weeks. There's a paper out there, Nines are Not Enough, which the author of that paper coined the term SLOgician. Go read that. Read any paper that has to do with SLOs. Educate yourself. And while you are educating yourself about that, start practicing. Write out SLOs, SLIs, SLOs, and error budgets for your services. Teach other folks how to implement those for their services because really the goal there is to use these to make your own service more reliable and to make your customers happy.
Jonan: And that's what it all comes down to in the end, right?
Jonan: I really appreciate that about the observability industry and the direction that we're headed generally that we are better able to focus on the customer, which should be in the forefront and maybe has not always been there.
Fred: Yes. As these services get larger and more mission-critical, reliability and customer happiness is really at the front of what we're trying to deliver.
Jonan: I absolutely agree. Well, thank you so much for coming on the show today. I very much appreciate your time. If people wanted to find you on the Internet, where would they go looking?
Fred: You can find me on Twitter. My Twitter handle is @phredmoyer, P-H-R-E-D-M-O-Y-E-R.
Jonan: Well, thank you so much, Fred, with a P-H-R-E-D. Again, I really appreciate your time. I look forward to having you back here.
Fred: Thank you, Jonan.
Jonan: We're going to prove you wrong about that product feature. This is the one prediction that I'm hoping comes true. When I have people make predictions on the show, I talk about zinging them a year from now, but I hope a year from now we are both right and that thing gets developed. That would be really useful.
Fred: Yeah, I think it's going across all vendors in the industry.
Jonan: Absolutely. Okay. In closing, I want to remind you all that you should be coming to FutureStack to the Nerd Island track to hang out with The Relicans because it's going to be awesome, and it's free. And you get to be a cute animal and have proximal audio conversations with your friends on our virtual island. So if you are not yet registered, go to therelicans.com/futurestack. And I will see you there. I hope you all have a wonderful day. Thank you.
Thank you so much for joining us. We really appreciate it. You can find the show notes for this episode along with all of the rest of The Relicans podcasts on therelicans.com. In fact, most anything The Relicans get up to online will be on that site. We'll see you next week. Take care.