There were the time, when there were no Google. I remember it. Computers were bigger, so the network equipment. Database were smaller, but Database Servers were quite sophisticated too. Communication circuits were slower, but the equipment, which power those circuits generated huge amount of telemetry. Computer systems were simpler, but still, every node, every application, every process, every pipeline was producing vast amount of telemetry. Outside of computer centers, any industrial application from power generation station, submarines, processing plants and the others, were the sources of of telemetry. And this telemetry was vital for an engineers and technicians who were responsible for controlling this equipment, processes and appliances.
And processing of this telemetry was managed, somehow, to the best of the abilities of the personnel and equipment. But what I do remember, the methodology of observation were quite polished and usually came from our colleagues, managing industrial applications. And the life of this engineers was really tough and governed by the books.
There were paper-tape recorders, light and analog indicators, digital indicators, printers printing status messages and so on. For example, if you would be an engineer controlling Nuclear reactor providing steam for a Power Generation plant in 80-th, you was expected to observe and control about 2000 near real-time parameters of the reactor, while you have to model the configuration of neutron field inside of the containment in your head (it took a considerable time for the computers of the era to do a proper calculations in time). Task seems to be a nearly impossible. But inside this chaos, there was a method.
Engineers, responsible for an industrial applications, developed and polished a ways of grouping of signals while at the same time establishing a firm relationship between different groups. While you are monitoring a complex real-time application, this was (and in many cases still is) the way for situation awareness and analysis. And I do like to mention that the first advanced Expert Systems were created in 80-th and 90-th, providing a significant help in creating instrumentation aiding the decision making. And in IT industry, we were followed the same suite as our fellow engineers in industry: separate and group signals, establish dependencies and relationships, visualize telemetry using mapping, charts and other mnemonic methods.
Then came this. The "Four Golden Signals". This principle, defined by an engineers from Google become a default view, replaced any own thoughts about best practices of "what we shall pay attention to" to a many monitoring practitioners. Why bother, Google gave you answer already: Latency, Traffic, Errors and Saturation.
- Latency - how long it is take for the process to get processed
- Traffic - measuring demand placed on the system
- Errors - measuring and recording a rate of errors within the system
- Saturation - measuring utilization of the system to estimate: How much more load it can handle.
So, what not to like about this definition of "golden signals", i.e. the signals that you "have to monitor" ?
If you can only measure four metrics of your user-facing system, focus on these four.
Everything. While the suggestion for the most important metrics for the user-facing system is right when it is note "if you can measure only four metrics, take this", I can not be silent while noting, that this approach narrowing a vision of the monitoring and observability practitioner. The whole assumption "if you can measure only four metric" is leading someone who is responsible for the health and well being of the equipment and application to a very tight spot. Spot from which practitioner will have a very limited view on the scope and complexity of the system and applications he responsible for. And must I say, that this self-limitation do have a nothing in common with a problem of telemetry aggregation and actually reduce the number of telemetry signals available to a human being. We will talk about this particular problem later, in other article.
I can see, why the Google came up with this suggestion. And the answer is - simplification. If you have to deal with hundreds-thousands or even million hosts and you want to cut the expenses then you implement a very strong approach to the equipment as to a "cattle", not as to a "pet". In this case, the idea that there are "golden signals", begins to make sense. I am still disagree with this approach, but those are the conditions under which this idea do make sense and fly off.
But ... do you want that simplification ? Can we afford that simplification in the particular environment ? First of all, any simplification carry a cost. And in the field of monitoring, this may be lost causality and impacted understanding of the relationship between different parts of the system.
And so, regardless of answering on above-mentioned question, great many monitoring practitioners had joined the "Holy Crusade for the golden signals". For years, I am observing, that people are discussing and implementing particular way of collecting the "golden signals", which telemetry is a "golden signals", and how to interpret them. The search for a "magical telemetry" become an obsession for some. And many others, they employ reverse, but still related approach to a monitoring and observability: they begin to collect all the signals that they think they needed and put them in the one flat pile. Then they begin an attempts to make a sense out of it. For example, two different monitoring platforms, such as Zabbix and Prometheus are very prominent example of that two different approaches, ether "find that magical telemetry" approach in a Zabbix and "gimme all and I will figure out" in Prometheus.
Need not I say, that the both of those approaches are not without flaws and we must be aware of that flaws.
- Search for a golden signals, as I already mention, put us in a position where we voluntarily restrict our field of view, as the reaction to the depth and variety of the types of the telemetry generated by our system. As the result, we will have an incomplete view of the situation without ability to understand, how incomplete this view is.
- When we are treating all our telemetry as "golden signals", we are literally swimming in the ocean of the signals and generated telemetry and while we are picking the ones we deem relevant, the scope and diversity of the telemetry makes it very difficult to bring "a picture to life", i.e. to create a model which adequately portray a reality.
So, two sides of the same approach are giving us the same outcome - a limited view of the problem. And I do want to bring-up the note, that visualization of the telemetry become lesser ... er ... visual. Charts and single value display are ruling the landscape.
What is the possible solution for this complex problem, you might ask. And I am proposing to look at this problem from three angles.
- Search for a change, instead of a threshold. I already cover some of the ideas in this article, and I will return and revisit this approach in more details later in other article.
- Search for the pattern of changes between different telemetry. Again, some of the ideas has been reviewed there and I will get back to this approach to how some shall "do the monitoring" in the future articles.
- Telemetry grouping, tagging and aggregation. And I will focus on telemetry grouping and tagging.
So, how the grouping and tagging help you to "open your horizon". Before answering the question, let's see the difference between ungrouped and untagged telemetry and grouped and tagged telemetry approach.
Here, we are seeing that the user do choose the flat "golden signal" approach to the telemetry and signals. In the reality, this user will be ether misinformed, if he chooses to minimize number of signals to regain control over incoming stream of data, or he will be completely lost the horizon, due to disorientation in the vast "forest" of telemetry data. This situation is also called: "Do not see the forest, behind the trees".
In this example, all telemetry items are grouped and tagged and you can use this as a most simple, but very effective way of telemetry aggregation. And yes, this very simple technics, will allow you to see your metrics as a groups, organized by a group tags and function organized by a function tags. And rather than ignore metrics or been exposed to too much of metrics, you will you will have a logically ordered metric
And this is exactly what our colleagues from the world of industrial application doing. They are separating telemetry by groups and functions and when working on the problem, they are focusing on prearranged set of the telemetry, relevant to a problem.
And now, you may say: hey, his is trivial, separating and tagging metrics ain't rocket science and I am doing that already. Congratulations ! You are on the right track for the total observability for your infrastructure and applications. But I do addressing this article to a practitioners, who are trying to see a telemetry as a flat pile of data. With a message: "Start to separate a large pile of data into a smaller piles"
And conveniently so, New Relic supports metrics dimensionality, by allowing you to set your own key-value associations to the metric. You can use that key-value data later on in NRQL and other New Relic instruments. This technics is called in New Relic "tagging" and you can read more about it here So, help yourself and add tags to your metrics. Do not limit yourself to some ill defined "golden signals". Use your tags everywhere, where appropriate. In UI, Alerts, data filtering and so on. Reduce data and alert fatigue by use this simple and effective way to aggregate and separate telemetry data. Do not forget to build a dashboards and alerts, based on your tagging. And I think, that we shall limit use of Chart widget only where appropriate. I will also talk about data visualization in latter articles.