Every joke said, usually deeply rooted in necessity for the human beings to tell some facts and a story as individual human being see fit best. And good jokes are good not for a few, but for many. Stating that, let me tell you an old joke, originated from some historical facts about Soviet Union, survived the years and Soviet Union itself and somewhat still alive and actual nowadays.
At the international exhibition, Soviet Union announced first of it kind, mechanized and robotized barber machine. Curious spectators visited the venue and observe huge grey box with a head-sized hole in the middle. Most asked the question: "How this work ?" The dude in white lab coat standing by machine and presenting tells, that this is a new and breakthrough invention, that removes necessity of the human barbers to cut someone hair. The machine can be installed everywhere and even drop on parachute into a combat zone. All you need to do is to put head into the hole, and machine will do the rest. At the end, your hair will be properly trimmed. "But ain't everyone have a different shape of a head, how the machine dealing with that ?", spectators asked curiously. Oh, no problem, tell them presenter. This difference will be maintained only till the first seance of cutting, after that there will be no difference.
As any joke, this one do have a hidden message and of course the true message in this joke is not about actually cutting the hair. It is about unifying of the views and the way of thinking in what you can call a "totalitarian state".
How this could be related to the Art of Monitoring and Observability, you may ask. In 2016, Beyer, B., Jones C., Murphy, N. & Petoff, J. came with an article Site Reliability Engineering. How Google runs production systems. in which they came to the idea of "Four Golden Signals"
While in general, this book is very useful for a practicing SRE, as it does have some very decent tips on the subjects of risks management, toil eliminating, tracking, troubleshooting and many other, this book is not a unquestionable "The Bible". Every SRE must read this book to apply what's applicable to own practices, but that about it.
One of the most questionable statement in this book is an idea of "golden signals". They are
And this is where the problems begins. You may ask, - Why I see this selection of those categories of the signals as "problematic" ? Well, like in the opening joke, there are "different heads" and this difference must be maintained even after "first cut". The problem, is that many people, including many observability practitioners trusting Google's opinion on the matter without any questions and a shreds of doubt. So, every time you are trying to apply those four generally useful categories to all IT practices and monitoring needs, sometimes you have to "hammer in" those categories forcefully. Why ?
First, there are multiple personas in observability business. And those different personas serving different needs, by providing different services. Defined by the Google "golden signals" are good for very narrow stretch of the IT professionals, mostly employed in the subset of "Site Reliability Engineer" roles as someone who is responsible for "reliability". Those signals are not "golden" even for a various tasks that traditional SRE oftentimes responsible for. Let me bring this list of some tasks, requiring an access to observability data and not covered by "golden signals" category:
- Performance monitoring.
- Capacity planning.
- External resources monitoring.
- Root cause analysis.
This non-exhaustive list of the tasks not even touching the needs of:
- DevOps "personas", whose tasks are mainly rotating around monitoring of processes and pipelines.
- SecOps "persona", who do care about traffic, but at the same time do involved in signals and patterns analysis in very broad terms.
- Data Warehouse managers and administrators "personas", responsible for capacity and resource management and monitoring as well as for pipelines and application pipelines management and observability.
And this is just a scratch on surface. Numerous "personas" in IT business do have a very different ideas about what is "golden signal" in the particular context and for specific and sometimes not well-defined purposes. And those ideas are rooted in the facts that different metrics, and sometimes calculated and compound metrics do make more sense for a specific tasks.
And what classic Google "golden metrics" are good for ? As mentioned before, latency, traffic, errors and saturation are good for measuring reliability and pretty much reliability only. The latency is a time measure of some operation or request. When "persona" is responsible for "reliability", latency is a one of the key KPI. But for other purposes, let say for example, "capacity planning", latency is a secondary KPI and capacity-based metrics are become more important. Traffic, is a non-descriptive measure of amount of ether bytes or requests generated to some endpoints. When you are taking care of "reliability", traffic is something usually directly related with latency, gives you an idea of how well your endpoints handling the load. DevOps "persona's" interest usually do not have direct relation with measuring the loads. So, this is a secondary telemetry for this "persona". Errors, while they are critical for operational, "reliability" tasks, they are secondary at-best for capacity planning and resource management. They are also secondary for a SecOps "persona", as for this class of IT personnel analysis of the signals and patterns are more crucial. Although, saturation is class of KPI that we can call "most universal" across different personas. Most "personas" do need to observe some kind of exhaustive resource as a primary task. So, saturation is spot-on more or less universal "golden signal".
So, what the verdict ? What kind of outcomes you can derive from this short article ? First and foremost, that there is no "golden signals" as they defined in Google "SRE Book" that are universal across the board. Different environments for different "personas" defining a different subset of metrics and categories, that "here and now" provide adequate and accurate view on specific problem or series of problems, through the metrics or categories that fit the best. And yes, exactly as there is no "golden metrics", we can say with certainty, that's building a computerized barber machine which will take in account different facts about different human beings will be the task which is not only not easy, but on the brink of feasibility and practicality.