The Relicans

Cover image for Kibbles & Bytes – Taming Production with Charity Majors
Mandy Moore
Mandy Moore

Posted on

Kibbles & Bytes – Taming Production with Charity Majors

Jonan Scheffler interviews Co-Founder and CTO of Honeycomb, Charity Majors, about successfully shipping code, Chaos Engineering, and why it’s important to embrace feature flags.

Should you find a burning need to share your thoughts or rants about the show, please spray them at While you're going to all the trouble of shipping us some bytes, please consider taking a moment to let us know what you'd like to hear on the show in the future. Despite the all-caps flaming you will receive in response, please know that we are sincerely interested in your feedback; we aim to appease. Follow us on the Twitters: @ObservyMcObserv.

play pause Observy McObservface

Jonan Scheffler: Hello and welcome back to Observy McObservface, proudly brought to you by New Relic's Developer Relations team, The Relicans. Observy is about observability in something a bit more than the traditional sense. It's often about technology and tools that we use to gain visibility into our systems. But it is also about people because, fundamentally, software is about people. You can think of Observy as something of an observability variety show where we will apply systems thinking and think critically about challenges across our entire industry, and we very much look forward to having you join us. You can find the show notes for this episode along with all of The Relicans podcasts on We're so pleased to have you here this week. Enjoy the show.

Hello and welcome back. I'm Jonan. And I'm joined today by Charity Majors. How are you, Charity?

Charity: Hey, I'm doing all right. It's Friday. I'm on PTO. It could be worse.

Jonan: You're on PTO today, and you're doing a podcast with me anyway?

Charity: Well, I stood you up last week. [chuckles]

Jonan: I really appreciate it. Thank you so much. I am myself very much looking forward to the weekend although I had a little bit of PTO last week, a couple of half days. So I need to do a little makeup work, maybe. So what are you going to do with your vacation?

Charity: I'm going to New York City.

Jonan: You're going to New York?

Charity: Yeah.

Jonan: Nice. I haven't been to New York in a long time. I've missed travel so much. I've done a little bit recently.

Charity: I'm going to meet Jessitron and Sarah Novotny out there for Airbnb for girls' week out in the town. We're just going to eat and drink and sleep, basically.

Jonan: This sounds amazing. I'm jealous. [laughs] So you work at this one company that does this one thing. What is that?

Charity: We do observability or o11y as I like to call it.

Jonan: o11y.

Charity: Yeah. It's so cute. How can you not?

Jonan: I know, right? I love it. O11y observability or o11y at a company called Honeycomb?

Charity: Yeah,

Jonan: And if you haven't had a chance to check it out, I absolutely encourage it. It's a pretty cool product. I got to try it for the first time just a couple of weeks ago, actually, and I was quite impressed. I like what you are building over there.

Charity: We have a really generous free tier. Lots of people seem to not know about it. But I feel like as an engineer, I personally wouldn't want to sink any engineering cycles into playing around with a product if I know that their free trial is just going to evaporate. So we have a pretty generous free tier that people can just play around with.

Jonan: Yeah, I am a big fan of that strategy. Actually, just today, I tried out a new product. There was a product planning. I was trying to get a map of some event strategy and stuff we're doing on the team. And I tried it out, and the free tier is just totally crippled. There is no useful feature in the free tier.

Charity: What?

Jonan: Yeah, it expires. They said no credit card required, which is great. But also let me use the product if you're going to do that.

Charity: That's no fun.

Jonan: So what sorts of things have you been up to lately over there, aside from a potential New York trip?

Charity: Well, we just had our o11ycon, our Observability Conference, a couple of weeks ago. They're super fun, and shipping code, doing stuff.

Jonan: Shipping code and doing stuff. I wonder if we could talk about how exactly that's accomplished because I'm very interested to hear about different companies' workflows and how they actually go about shipping.

Charity: So o11ycon was sort of...I don't know if you knew this or not. There are not many ops founders out there, [chuckles], which I think explains a lot of the, let's say f*cked up situations that I've gotten dragged into over and over throughout my career, just like, what are we even doing? The first question I always ask on day one is, are you backing up your database? And to this date, the answer has always been no. [chuckles] So day one, you’re an infrastructure person. Pro-tip, databases not backed up. So Honeycomb was lucky enough Patty was often back over here. We actually had an ops co-founder for better or for worse. And so one of the first things I did was set up CI/CD to automatically deploy every 10 minutes from Cron basically, just if any changes have been committed in the past 10 minutes, it kicks off CI/CD, creates an artifact, and deploys it everywhere. And it's great.

People often get scared when I talk about this, but the thing is, it can be really challenging to dig yourself out of a pit when you've had this long and agonizing CI/CD process. But if you start off right and you just hold the line, it's the easiest way to write and build and ship software. I think of it like the myth about Alexander the Great and how he would pick up his horse every morning for breakfast when he was a little boy so that when he grew up, he was able to lift a full-grown horse. He just lifted it up every day once a day. It's so easy when you just start that way and just stay that way. Never fall into the pit of despair.

Jonan: And backing it out the other way is really hard. If you've worked out a world where you are deploying in a six-week cycle or something, or we deploy once a quarter, it's really hard to get to 10 minutes. So you ship code from your main branch automatically every 10 minutes. If I'm about to merge something, there's no other step that is actually “ship this to production.” It just happens.

Charity: Well, I'm simplifying a bit. It's matured a bit as we've grown, of course. So it does run tests. We run a lot of tests, et cetera. You can put a hold on whatever, and now we have production. Then we have the dogfood cluster, which watches production, and then we have the kibble cluster, which watches dogfood.

Jonan: [chuckles]

Charity: And so it actually auto deploys to kibble first. And then I think it waits like an hour and then it promotes to dogfood, and then it waits like an hour, and then it promotes to production. And we do use feature flags. So we're not crazy, but it's super fast. It's great because you're writing code. You write some instrumentation while you're writing it. And then you merge, and you know within 10 minutes, or so you can go, and you can look, and changes will be there. And you can look at it through the lens of your instrumentation and say, "Is it doing what I expected it to do, and does anything else look weird?" So super fresh in your mind.

Jonan: And so you don't see your error rates spike on kibble, and then you know that you're on a good path. But you could at that moment; if you see something going terribly wrong in kibble, you could roll it back.

Charity: Yeah, you can pull it pack, or we mostly just roll forward. Because you know what you've done, you're using small difs. You know what you've done. You see the error. You have the context. You're like, oh, okay, I go fix it, check in another dif. There's very little rolling back.

Jonan: Yeah. In an ideal circumstance, there's never any rolling back. You're able to just patch and keep going, right?

Charity: Exactly.

Jonan: So then these different environments that you've set off which are quite cleverly named, I love the names, dogfood...kibbles. Are they pretty close replicas of your actual production environment except in horsepower, presumably?

Charity: Basically, we Terraform everything. So there's no such thing as a staging environment that matches prod. The intention is not for it to even match prod, but why not? It's the same way of running systems. Yeah, they're pretty similar.

Jonan: This is something I actually wanted to ask you about because I know I've seen you talk about this a little bit about mirroring production just being kind of a fallacy, like, that's not real.

Charity: Yeah. If you have a system of any respectable size whatsoever, it is a fool's game. You cannot do it. You shouldn't even try. [chuckles]

Jonan: Because when you have a large production system, you're allowing yourself to forgo certain variables. If you were actually cloning your entire production infrastructure hypothetically, that is a possibility, but people, of course, aren't doing that because it's tremendously expensive and effectively a waste of a lot of resources.

Charity: You couldn't clone the traffic the same, like, concurrency. And even if you could, so the gold standard for testing in production and everything is (and I know this because I've done this. I've been the person who writes the software three times) to capture 24 hours’ worth of traffic, store it offline, take a snapshot of your database at the beginning and store it off, store it 24 hours, and then replay it using a bunch of workers. You can tune the concurrency and everything. That's the gold standard. And you still can't guarantee that you'll get the same weird things happening at the same time and pile-ons, and backoffs, and client behaviorism and whatnot. And even if you could, you still get the Michael Jackson problem, which is Michael Jackson can only die once, and you cannot predict when it's going to be. [laughter]

So you really should just give up and embrace the uncertainty and lean into the idea that your systems are broken right now. They're broken all the time. And it's just a question of managing the chaos and learning to love it and keep it. This is why SLOs are so wonderful. You set a level at which your users are happy. That's the level that you aspire to, not perfection. Anytime you think that your system is running correctly, I have news for you; it is not. [chuckles] You can't just see the problems. You can't just see the bugs that are there.

Jonan: You're just lacking that observability piece. Your o11y is not what you think of it.

Charity: Observability. This is a funny thing that happens; true story, when we're rolling out Honeycomb with customers is we'll be rolling it out, and they go, "Oh my God, there's a bug. There's a bug. These errors we just found them." And we're like, "Yes, grasshopper, there are many bugs. They've been there since time immemorial. Can we get on with the deployment now?"

Jonan: [laughs]

Charity: "God, there's another bug." And we're just like, "Oh my God." [laughter]

Jonan: Yes, there are. That's how it works.

Charity: Bugs everywhere. [chuckles]

Jonan: And you just have to learn to live with the chaos and push through it.

Charity: Yeah, which is not to say you shouldn't fix the bugs. It's great to have a tool that rewards your curiosity. We're engineers. If we see something that's broken and we're empowered to fix it, we want to fix it, and we get the dopamine hit. Oh yeah, fix the thing, make customers happy. And that's great, but it's not the same as entering into the illusion that our systems are perfect. [chuckles]

Jonan: Right. I guess in that vein; then you brought up chaos. How do you feel about chaos engineering?

Charity: I think that chaos engineering without observability is just chaos. I see people self-inflicting problems. And if they don't have sufficient Telemetry to be able to clean up after themselves, and if they're finding ten days later, two weeks later, oh, this is still running an old version from when we broke it, then you're not tall enough to ride this ride. [laughter] There's no point. You should probably focus on fixing the things that you already know about that are biting you in the ass every day. And when you run out of those things, then start inflicting some more on yourself. But I do think that generally, the trend of the past five years towards the center of gravity swinging away from pre-production towards production tooling was 100% overdue and 1000% good.

Jonan: Because you're testing in production, whether you know it or not.

Charity: Because production is the only environment that matters at all. Your code, if it's not in prod, it's dead code. It doesn't matter. And all the tooling, all the elaborates everybody gets through individual staging container, and blah, blah, blah, it's all effectively wasted effort to a point. It's not always said, but I feel like up until recently, it was like the bulk of our energy went towards pre-production, and what was leftover went towards production, which is backwards. [chuckles]

Jonan: Yeah, because you're spending all of your time as a developer working in this developer environment, which does not replicate production.

Charity: Developers love to live in the shiny, little castle on the hill where they're like, "I'm writing code. I merged. My job is done, woohoo." And no, you live in production too. [chuckles]

Jonan: Yeah, I like this approach. So if someone were listening, at the beginning of the episode, you said, "I often ask the question is your database backed up? And the answer, unfortunately, is usually no." Let's imagine then a hypothetical someone listening to this podcast today who answered no. And where do they start? We've talked about a few different things here. We've talked about testing in production and developer environments, CI/CD, backup your database. Assuming they have accomplished backup their database and none of the other steps, what's the next one?

Charity: Look at CI/CD. Get auto-deploy working. If at all humanly possible, swallow your fear; it will make your life a better place. Embrace feature flags, too, because decoupling deploys from releases is really liberating. There are some things that are cornerstone habits where if you get these things right, a lot of good things just automatically flow from it. And if you get them wrong, a lot of f*cked up things just automatically flows from it. And things like having CI/CD, auto-deploy, that's a really good habit. It's like starting your day with exercise. Things like using feature flags it's like starting your day with a good breakfast. They're just really good habits that make the right things tend to happen. And what I see is a lot of teams who are just chasing their tails, or they're chasing after these symptoms and these pathologies. And like, oh, the difs are too big, or code review takes too long. Oh, deploys are tricky, oh, we roll back a lot, oh, they're flaky, and they should fix the problems at the source. [chuckles]

Jonan: Because all of those other ones are just symptoms of the actual issue.

Charity: Yeah, they're downstream from it, yeah.

Jonan: So we got CI/CD set up and...

Charity: And we're auto-deploying.

Jonan: And we're auto-deploying. But is observability before or after?

Charity: Oh, I should hope that it's...

Jonan: Always.

Charity: Always. You should be instrumenting your code. It's the lens through which you see your work. Your job is never not easier with observability. The way people are doing it now, they're driving blind, and that's the hard way. Adding observability is not some extra challenging step that you take. It's like putting on your glasses before you drive down the highway if you're blind like me. It's giving yourself the gift of the information that you need to do your job well.

Jonan: And getting that information out should be part of the entire process. So we are now auto-deploying, and we have our feature flags. You mentioned that's probably around here, too, because that's a necessary step to achieve successful CI/CD, do you think?

Charity: Feature flags what they do is they sever the link between deploying and releasing. Deploying should be like your heartbeat, just like you do it all the time. It's not eventful. It just happens. It's not a big deal because that's how you deliver value to users. But product managers have all kinds of valid reasons to want to release things on a timer at a certain time of day, on a certain day of the week. And so, by deploying your code in a way that doesn't immediately release it to users, you know you…both ways.

Jonan: And you can do things like coordinate a marketing campaign around a new feature that's coming out.

Charity: Yeah, and you can do really cool things like use the toggle or a velvet rope to let 1% of traffic try out your change or only ship it to internal users first, or only ship it to your beta users who have signed up for the chaos. There are all these cool things. Deploys shouldn't be on or off. They shouldn't be binary. It's just part of the process of gaining confidence in your code. And it's a process.

Jonan: I like this term very much. I would like to work for companies that you have guided down this path.

Charity: You know what? This is one big hammer to bring out if you're getting resistance from your org, which is that this makes it so much easier to recruit and retain great engineers because it's so much less of a sh*t work and so much more of your brainpower is spent solving real, interesting problems that move the business forward every day.

Jonan: And you don't have to sell people on the dream in the first place. It's not even just the implementation of all of this. It's also convincing people that it's necessary and important.

Charity: Yeah, it's true.

Jonan: So I ask people a couple of common questions on this show; one of them is to make a prediction about what is coming, I guess, in this case for the observability industry. As I see you as someone quite well informed in this area, what do you predict is going to be a trend we see over the next year or so? So we can come back on this show, and we can discuss it for better or worse. What do you think the next year is going to look like?

Charity: So next year in observability?

Jonan: Like, I think that CI/CD advice that I give will finally be listened to, and the industry will improve.

Charity: I think that the OpenTelemetry project is going to get a lot more buy-in, and I foresee the end of people having to...we are so far behind where we should be just in terms of common knowledge about how to instrument your code. And there are so many lineages and philosophies and printf to worry about. I think that OTel is coming to bring some real sanity to the market. And it's really good for users because if you've instrumented your code with OTel, OpenTelemetry, you can switch from vendor to vendor without having to re-instrument your code. So everyone should be pretty excited about this.

Jonan: I am certainly excited about it. I have been waiting a long time to see that happen.

Charity: Amen.

Jonan: I get frustrated with proprietary file formats similarly. So as OpenTelemetry gains traction and people are able to move between vendors more easily, what do you think prevents the commoditization of those platforms?

Charity: I think vendors are going to have to compete on user experience, and I don't think we're anywhere near commoditization. I take it back. Metrics are being commoditized, and logs are being commoditized. That is not the same as observability, and observability does not have any goddamn pillars. [laughs] The experience of understanding your code in production is nowhere near commoditization.

Jonan: Yeah, I totally agree with you. So we have another question we like to ask, which is what advice you might give to someone. There are certainly people listening today who aspire to be in your position someday. You've got a very successful career in this industry. What advice would you give to them or to an earlier version of yourself if they're looking to follow a similar path?

Charity: Oh, goodness. My career has been a series of accidents. I don't know. I feel like most of my quote-unquote "success" is attributable to the fact that I've always had an overdeveloped sense of responsibility for the things that are around me. And that's not a bad path, I guess, feeling responsible for your code, and your services, and the people around you. I don't know.

I think that maximizing your opportunities. I think that remembering that your only get one. And it's your greatest asset. It's a multi-million dollar asset that you've got. I think people stay at bad jobs for way too long. For me, I had two jobs where I stayed for a year and a day, and I knew within the first week that it wasn't for me. I should have just left, but I felt like I owed it to them, but I didn't. I should have left. And in general, people stay at bad jobs for way too long. And we had this idea that great engineers make great teams, but it's exactly the opposite. Great teams are what make great engineers. So if you're not in a great team and you're in the first half of your career, make it your top priority to go find one and join it.

Jonan: That's some of the best advice I've heard all week. I feel like in my career that the times that I remember having grown the most were entirely dependent on the team dynamic that I was a part of, that strong mentorship piece, people who, like you, care about the things around them.

Charity: This is an apprenticeship industry. We learn from each other.

Jonan: Yeah, excellent advice. Well, with that, Charity, I suggest we call it an episode unless you have any parting thoughts for our listeners.

Charity: Happy weekend!

Jonan: Happy weekend. Enjoy your trip to New York. I really appreciate you coming on the show, and hopefully, we get to do it again soon.

Thank you so much for joining us. We really appreciate it. You can find the show notes for this episode along with all of the rest of The Relicans podcasts on In fact, most anything The Relicans get up to online will be on that site. We'll see you next week. Take care.

Discussion (0)