Establishing causality is a one of the most important tasks of the monitoring and observability. Let's try to establish, what is causality, why it's important and how you can establish causality by observing telemetry behavior.
Before we venturing in more advanced topic of how we will establish causality, let's establish some basic terms that we will be using later on in our talk.
Causality is a true or false fact, that one event is influencing another event. Or saying differently, causality is a logical measure of the fact, that event "A" influencing another event "B" to the point, when we can establish the truth stating that "B' is happens because of "A". In a way, causality is an abstraction, using which we can explain why the complex system is changing states, and why it is end up in specific set of states at each present moment. Speaking in mathematical terms, causality is a non-deterministic automata which as definition implies can lead to one or more or none transitions from a current state in finite state machine that we are observing.
What are the challenge of the observability and causality relationship ? On a daily base, management personnel are using monitoring and observation instrumentation to control the systems they are responsible for. One of the goals of the control is to be able to understand the behavior of the system at each and every moment. And this understanding requires a deep understanding of the causality for the most and ideally each and every state of the system. So, imagine, that you are trying to decipher a complex finite state machine while using only limited known states and transitions. And by study of the behavior of the available telemetry data, you are trying to fill the gaps. Or otherwise comprehend not only which states you system currently are, but also, how it's gets there. For a large and complex systems, this is a very complex task.
As I mentioned above, as far as control goes, your environment, the systems you are controlling are nothing, but a large and complex finite state machines. Mathematically speaking, we can define a finite state machine as an abstract machine, which can be in one or more states at any given time. And a state in finite state machine could be defined as a set of conditions, describing machine at the specific moment of time. Moving between the states are called transitions. And without understanding of the causality, which including original state, transition and current state, it is very difficult to control system. And unfortunately, many monitoring practitioners are opted for a simpler, threshold monitoring, where they letting the observability platform to detect some thresholds and define a reaction on that threshold. While this approach allows to solve some simple and practical tasks, it does not bring adepts of that approach to where they truly wants to be, to a control, instead of reacting. Let's look at the example. In one of those days for our SRE starts with a message that the main database is down. Without having any causality context, he find that in addition to the database down condition, he having several others. Trying to bring resolution quickly, he tried to restart database. Tried and failed. Next, he find that the free disk space on a database server is zero, relieved, he is ssh-ing on DB server, find that there a lot of some kinda temp files on that partition, clearing them up, restarting database and now - success. He is closing the ticket. In 15 minutes, database is down again and free space is again, zero. To cut the long story short, later on he found that the temp files were legit, produced by data ingestion procedure, which took the source data from another server on which this data were collected from third-party provider and due to a bug in Ansible recipe, the permission on the folder where the source files were stocked was set incorrect and processed files was not removed by ingestion procedure, so ingestion re-import the old data alongside with new, creating too many temp files and crash the database. And what was the causality ? Change in Ansible recipe, wrong permissions, excessive files in IN queue for ingestion, excessive number of the temp files in ingestions, leading to a overuse of disk space which cause crash in database. So, if you do not let your observability platform to help you to establish relationship between current state and previous states, you are solving this kind of puzzles every time as they are new. Which leads to more downtime, decreased SLA, and all kind of losses. So, establishing causality is vital.
There are number of ways to establish causality. I will review two of them, one could be called "User defined", another is "Searching for causality through pattens observation". But you shall not limit yourself in finding the best way to detect the root cause for your particular case. Also, bear in mind that sometimes, the best way is a combination of ways.
IT using an expert systems for quite a while. The idea behind of the expert system is that some knowledge is represented in form "if-then" clauses called rules, or "assign" clauses called facts.
(fact MyTemperatureCelsius 38.2 ) (> MyTemperatureCelsius 36.6 ) => (print "You have a fever" MyTemperatureCelsius )
Combination of the statically defined rules and facts called "Knowledge base" and in many cases, the statically-defined "Knowledge base" prepared by human being, who is expert in that specific field. Another part of the Expert system is called "Inference Engine". The idea behind this piece of software is to produce a new facts and rules by applying an "Inference Engine" rules to a rules and facts of "Knowledge base". So the combination of the "Knowledge base" and "Inference Engine" will give you a an expert-induced, extendable system that you can use to detect causality in your systems.
What are the benefits and disadvantages of an Expert Systems ? First, they are good as they Knowledge Base. If the expert who made the rules and facts for you are short to be a real expert, then results produced by the system will be internally flawed. But if Knowledge Base is good, you will get good, predictable results, bringing you a well detected causality. Second, the cost of maintenance of Expert Systems are high. You have to constantly maintain freshness and accuracy of existing rules and put a serious efforts to bring up a new rules. Otherwise, Expert System is a great tool.
Nothing in complex systems happens "in the vacuum". Spikes in cluster CPU utilization may be tied to a load spikes, Abnormalities in network telemetry could be resulted from security-related events that you can detect through the patterns in related telemetry. But how we can bring together patterns in all observed telemetry and use detected patterns to establish causality.
First step, is to create a Telemetry Observation Matrix. This matrix is different from Telemetry Matrix that been discussed in other chapters. When Telemetry Matrix is 2-dimensional structure, where the columns are the sources and rows are the telemetry types, so the matrix itself representing a momentarily state of the telemetry produced by the system. Telemetry Observation Matrix is a 3-dimensional matrix, where x-axis (or columns) are the sources, y-axis (or rows) are the telemetry types and z-axis is a vector of the telemetry data samples. The size of the x-axis is equal to the number of sources that you have. The size of y-axis is equal to the number of telemetry types that your system have and size of the z-axis is equal to the size of the "Observability Horizon"
Observability Horizon is the number of telemetry data samples that you will be using to observe and detect a patterns in the data.
The more data samples you will have for review, more back in time behavior you will study. But more data will not necessary guarantee that you will reliably detect a pattern. So, choosing the "Observability Horizon" is a empirical, sometimes try and try again process. I am recommending that you set "Observability Horizon" as large as reasonably possible, within your computing capabilities, and then subsample smaller "Observability Horizons" from the larger dataset.
Now, in "Telemetry Observation Matrix" we do have not only very extended set of various telemetry data, associated with various sources, but also historical values of this data placed on a timeline. Now, we do have all the data we need for a pattern search. And for the pattern search, we are going to use a very simple, forward propagating neural network. Feed Forward Network is a one of the most simplest Neural Networks. Feed Forward network is a neural network where the data travels in one direction and connections between nodes doesn’t form a cycles. So, back-propagation of the data is explicitly forbidden. Feed Forward Network also called a “Single Layer Perceptron”, where the data received on input layer of the nodes carried to the output layer through the single hidden layer. The number of "neurons" in input layer shall match to the number of data items in Observability Horizon. So, if you are collecting telemetry data for a large Observability Horizon, and planning to use the segments of that data for a patterns search, you have to prepare a Perceptron with configuration matching to the size of your data.
Next, you have to train your Perceptron. What is Perceptron training ? You are feeding the Perceptron with samples of the data of the same arity as your Observability Horizon and specify how close this sample to one of the patterns you are looking for. For example, let say that we are looking for three types of patterns in the data: upswing, where the each next element value in timeline is greater than the previous one, downswing, when next element value is smaller, and stable, where element value is about the same. In our training data, we will be using normalized data. I will explain the purpose of that later on. And now, let's look at the sample of training data
[0.0, 0.0, 0.0, 0.0, 0.0] [0.0, 0.0, 1.0] [0.0, 0.2, 0.4, 0.6, 0.8] [1.0, 0.0, 0.0] [0.8, 0.6, 0.4, 0.2, 0.0] [0.0, 1.0, 0.0]
What can we see when looking on this training data ? On the left side, there is an array with data samples of length (or arity) - 5. Apparently, we can conclude that the five is the size of Observability Horizon in this sample. And on the right side, we can see a pattern classificator, indicating how close this sample to the one of pattern category. Three numbers tells us that there are three categories. The first category is "upswing", second - "downswing" and third one is "stable". Of course, just a three samples are not enough to train our Perceptron, so, please prepare at least few dozen normalized data samples for each category. And for each sample of the training data, you have to specify, how close this sample to the category that you are looking for. One thing before we continue, your sample can be matching for more than one category.
[0.4, 0.4, 0.4, 0.5, 0.5] [0.5, 0.0, 0.5]
In this training sample, it is difficult to say, if it's represent a slight deviation of "stable" category, or it is kind of "upswing". So, we can set, that it kind of sample from two different categories with lower "certainty".
After you train your Perceptron with prepared training dataset for each category, your Artificial Neural Network will be ready to detect patterns in your data.
And when you do have a Telemetry Observation Matrix with telemetry data samples, and also you do have a prepared and trained Perceptron, you are ready to feed your data to a Perceptron to see, which categories Perceptron can detect, right ? Not yet. First, let me remind you that you trained your perceptron with Normalized data. We will discuss, what is that in the second. But first, let's talk about data smoothing. Data smoothing is a very important step, which will take a vector of numbers as an input and will produce a new vector of same size but with smoothed values. This procedure allow us to reduce variability of the telemetry values, which in return helps to determine a true pattern in the data. But what is variability reduction ? Idea is very simple - by utilizing Smoothed Moving Average (SMMA) you are using calculating a dynamic arithmetic mean for each element of your original data vector, the one containing telemetry from Observability Horizon. SMMA is a extension of Simple Moving Average as it is using a sliding window in which mean is calculated. One of the key SMMA benefit is that this algorithm is very effective in reducing noise in the data. So, reduction of the data variability through use of the SMMA algorithm helps to eliminate values from your Observability Horizon, that is not representing your pattern.
Next step, before you'll feed a sample of telemetry data to a Perceptron is called "Data Normalization". Min-Max data normalization is a process of producing of the new vector where values derived from original vector using Min-Max feature scaling. The actual outcome of the scaling is that all data from the sample is scaled to fit between 0 and 1.
Why to scale the values ? When you want to see the shape or pattern of the sample, or if you want to compare or match different samples with different scales, you have to normalize data, to make it suitable for the pattern matching.
And now, we do have a an array or vector of values, derived from original Observation Horizon, but smoothed and normalized. At this point we also do have a Perceptron, trained to recognize a pattern categories. Everything is ready to produce Telemetry Pattern Matrix.
Telemetry Pattern Matrix is a 2-dimensional matrix, where in the x-axis or the column, we arrange sources. In the y-axis, or the row, we store telemetry type. And data element of Telemetry Pattern Matrix is a tuple, where each element indicates proximity of the telemetry sample to a category with the same index in the tuple, as in Perceptron training data.
Telemetry Pattern Matrix produced from Telemetry Observation Matrix by means of feeding the telemetry data sample from each Observability Horizon vector to a trained Perceptron. The outcome, the tuple with proximity information to the known patterns, will be stored in each "cell", defined by the source and telemetry type.
And we can use Telemetry Pattern Matrix to help us to establish causality in our system, as now, we do have a full behavioral data for the each telemetry item, based on chosen Observability Horizon. With help of the Telemetry Pattern Matrix, we can match different types of patterns against each other to see, which behavior happens at the same time.
And the next step is to match known telemetry behavior. We can mix and match programmatically different types of the telemetry behavior, while trying to match behaviors to the groups of telemetry sources and telemetry data. To help us with causality visualization, we can build different types of the heat maps using different patterns matches to detect if we do have a some clusters of behaviors, which happens at the same time.