This article is not a tutorial, but is rather philosophical reflections over the question that many professionals, involved in creation or use of monitoring systems asking: "Why we are doing what we are doing and how we are doing that".
What I will try to say in this blog post, may sound strange or incomplete. And it is the way how I see this problem. I do not know an ultimate answer on "what is monitoring and what is the way to do it proper". I doubt any single human being knows an answer on this question, which will cross all t's and dot all i's. But this post is rather silly attempt to bring some clarity for myself, and if some of my thoughts will be useful to you as well, I will be glad that I've spent a time to bring all my thoughts on the matter together. Of course, the side effect of this article as I perceive is to ask more questions and dive deeper into this very interesting problem.
And now, without any further ado ...
Every time when speaking about IT monitoring and observability, many IT professionals making the same mistakes. They are to make an impression that the idea of monitoring and observability for the matter it is something:
- Created recently, as a part of the IT revolution.
- Exists separately from other forms of monitoring (industrial and others).
- Unique in it's approaches and none of the experiences that's been gained by the engineers of the past.
Should I say, that none of those statements are true. Monitoring been a part of human activities for centuries. Every time, when there are some process, someone usually
observing that process. Making sure, that it is through. And IT monitoring and observability, as it emerging as an IT topic initially was not treated differently from any other form of monitoring and observability. And it shouldn't today. Because there is nothing new in the world in the way of how the human being connects with surroundings. We can create a better tools, but we can not change (yet) nether for example, mechanics of an eye, or the ways how the human brain is processing an input data about that surroundings. And with this idea in mind, let us try to answer on one simple question:
There are many answers on that question in IT crowd, but let's think, for a second:
- Why historically, people were observing and watching the fire ?
- Why captains were watching the weather, tides, directions ?
- Why the train engineers were watching a boiler ?
- Why pilots watching airplane controls ?
This is short of examples, but it'll do to make the point. The purposes of all those activities is to keep control over some process. Of the fire, of the ship movements, of the engine safety, of the control of the airplane. So, every time, we are thinking of "monitoring" or "observability", we do need to think "this is all for a control of some process". Not for the personal curiosity. Not to satisfy some external requirements without questioning "why". Not to just for an establishing some fact, without bringing this fact to a proper context. So, first and foremost task of any "monitoring" and "observability" is "Control". Everything else is a secondary task. Even if we are involved in monitoring and instrumentation of some scientific experiment, the primary task is to keep an experiment under control to a maximum extent, and then gain a scientific data. So, after answering of the first and probably most important question, we must ask ourselves:
And at this point, there will be no shortage of various answers. We will be hearing about how we get and compute the data, about thresholds and aggregations, about statistical computation and visualization. But let us step back and think for the moment: "What is the matter for a controlling ? How we can be sure, that process is controllable ? How we shall organize our observability, so each element in this effort matters ?"
And again, I am making step back from beginning to propose a multiple-choice solutions and try to dig to the root of the whole idea of "observability". What we are observing while seeking for the control over something ? Every time when we are building fire, we are watching for that the fire shouldn't die or or get out of control. The whole purpose of the seamanship is to deliver humans and cargo across the waters and while doing that, you are taking care about unfavorable conditions which preventing you from get this delivery done. Every time when you are controlling some process or mechanism we are in fact looking for that what we are controlling, did the job intended, by removal of the obstacles. So, we can say, that the method of control is detection and preventing obstacles that standing on the way of some process, and may block this process from fulfilling the processes purpose. So, in order to keep something under control, we have to detect not the problems (if train boiler blown to smithereens, we can safely say, it is beyond any control), but traits that leads to the problems. But what are those traits ? What are the indicators that the problem is about ? In the dynamic environment, which are characteristic of any process that we are looking to gain control over, those traits are the collection of the dynamic patterns that you can observe through various methods.
So, if we are looking at the bottom of the problem, we have to say, that the method of monitoring and observability is constant search and detection of the sets of patterns in the data. For the user, who is trying keep some processes under control, observability platform produce help in the way of catching patterns that may lead to the problems, as well as the patterns that leads to the restoration of normality.
So, while the patterns are the primary tool for controlling some processes, to have just patterns, is not enough. But why ? What's wrong with just the patterns ? By detecting potential problems through the search for the known (or hardly known) patterns in telemetry, you will remove a great burden from observer, but nether observer, nor us knows everything. And as second line of observation, we may set the relationship, between known pattern and other patterns and events that happens in the system. Because "all good", like "all bad" are related to each other, and one pattern is usually lead to another. Otherwise, you may miss something, that you do not know how to detect.
Let me recapture my thoughts and came up to the conclusion:
- Monitoring and observability is for control. If you do not have an outcomes of monitoring that implies control over your process, you are not monitoring, you are just busy, you not monitoring.
- The purpose of the control is not to let the undesirable outcome to progress beyond point, where you can not control and contain the situation. If your monitoring informing you that the bad thing already happens, or worse, you learned that from your user or the morning newspapers, you do not have nether monitoring, nor observability.
- The way of detecting the problems is through observation of telemetry pattern. In dynamic.
- The way of detecting of normality is through observation of telemetry patterns. In dynamic.
- The way of finding the new patterns are through observation of relationship between known patterns and behavioral patterns of other telemetry items.
And now, let me counter-sample of how we can get the inner idea of monitoring wrong:
- We are considering monitoring and observability as an instrument for a secondary-level problems, such as inventory control, capacity management and planning and other similar.
- We are "detecting problems", instead of detecting the conditions that may lead to the problem.
- We are overly obsessed with golden signals and thresholds for those golden signals.
- We are trying to automate a "momentary analysis", based on thresholds and at the same time trying to apply dynamic analysis without automation. This means, that we measuring thresholds and create our alerts based on thresholds and at the same time seeking an individual telemetry charts for a mental pattern recognition.
- We are overly obsessed with problems, and ignoring detection of normality.
- We are not seeking a relationships between behavior of different telemetry items.
So, this is all what I can say about this very intricate engineering problem. While I am not claiming that I am 100% right, but I am rather claiming that thoughts, that reflections are the result of years of observing monitoring and observability as IT and industrial discipline and how engineers perceive it.