IT, network, security infrastructures and applications are the sources of huge amount of telemetry data. Even the number of unique telemetry item types is so huge, it is often time making use of the single individual telemetry item useless. Especially, when you are dealing with huge numbers of similar items. For example, when you are observing telemetry data coming from the cluster of 50000 nodes, not likely you will drill down to the observability metrics of the individual node. Not at first, and likely not in seconds, thirds and so on.
When I am bringing up example like this one, I am often time telling a story of the operator on Chernobyl Power Station during the time, when catastrophe happens in 1986. While we are setting all other factors aside, let's focus on fact, that this engineer was expected to watch more than 2000 real-time telemetry items generated by nuclear reactor. This is task, very close to a "hugging an infinity". Means the infrastructure generating more telemetry than you can comprehend and process at single moment of a time. That engineer was swimming in the ocean of telemetry data and on one hand, he did not had an access to some critical telemetry data and on other hand, he had too many other data, which was not critical to the decision that he should make. That was one of the factor leading to a disaster.
This example is teaching us: when you have a way to many telemetry items, you have to artificially decrease the number, scope and granularity of observed telemetry and one of the methods of bringing the number of observed telemetry types down is by using data aggregation.
Data aggregation is a method of derivation of the new telemetry data by performing mathematical and statistical computations over sets of the raw or calculated or even aggregated telemetry data. But wait, didn't I say that you have to decrease the number of observed telemetry ? Of course, I did. The point in aggregation, that you are taking blocks of original telemetry, generate a new, telemetry item from the original telemetry and use that aggregated telemetry item in lieu of original data.
For example, what would be easier (and more meaningful), observing 50000 metrics with CPU utilization from your application cluster, or calculate average CPU utilization across all nodes of your cluster and observe that one metric ?
As it been said, aggregation is a process of producing new telemetry data by applying some computation to another telemetry data, so the first question is: "Which data we are going to choose ?" And to answer on this question, we will introduce a "Telemetry Matrix". But what is the "Telemetry Matrix" ?
Imagine, that you are organizing your telemetry data, associated with certain point on your timeline as a dimensional matrix, i.e. data representation, organized in rows and columns. Like that:
The sources of the telemetry will be in columns, and types of the telemetry data will be in rows. So, essentially, we will have a table, where all telemetry items belonging to the same source will be in the same column, and specific telemetry type will be in the row, arranged according to the telemetry source. Just like that:
In this example, in columns, we will have a telemetry data for the hosts Host0-3. In the rows, we will have a specific telemetry items, each obtained from specific hosts and arranged accordingly list of those hosts as they organized in column. As we can see, the x with indexes 0 and 2 representing a free disk space for disk number 2, obtained from host with number 0.
Before we will perform a calculation over the data in Telemetry Matrix, we have to choose a "aggregation orientation". What is that ? There are two types of the aggregation orientations: "horizontal" and "vertical".
Horizontal aggregation (or as properly called "row aggregation") is the calculation performed over row of data in Telemetry Matrix. As we can clearly deduce, row aggregation, aggregates same type of the telemetry, spanned across multiple hosts.
Vertical aggregation, or "column aggregation" is a aggregation calculation performed over column of the telemetry data in Telemetry Matrix. And this type of calculation performing over different types of telemetry data, obtained from the same source.
Data samples used in ether row or column aggregations not necessary have to be a neighbors in the Telemetry Matrix.
Before we will go to the Aggregation functions, let's talk a little about use cases of row and column aggregations. If you are looking to aggregate same telemetry type across multiple hosts, then your choice will be a row aggregation. If you are looking to aggregate a different types of the telemetry within same host, then go with a column aggregation.
There is no "hard and present" list of the functions, that you can or shall use with telemetry aggregation. Different aggregation functions will produce you different new telemetry data and which data you want to create and how you are planning to use the data created by performing various computation over data in Telemetry Matrix, depending on your particular case and needs. But let me review few aggregation functions I am using the most.
Summary, is a one of the most basic aggregation function.
This calculation will give you a mathematical summary of all elements from your selected data sample. What are the use cases for the summary aggregation ? Every time, when you are calculating the capacity of some sort, you can not do that without a summary. For example: capacity sizes of the database partitions, consists from the summary of the capacities of the individual drives. Another example is a capacity of the trunked network circuit, which calculates as the summary of capacities of all network circuits combined in a single trunk. So, while the capacities are most common use case for the summary aggregation, there are other applications for this type of calculation. Total throughputs, total sizes of something (partition size, log or data files size) are also common use cases.
Calculation of the average is probably the second most popular aggregation function. While there are quite a few ways on how you can calculate an average and different types of average calculations will give you different results, I will not go into that specifics. This will be a topic of another article. I will focus on most popular average computation: mathematical average, or mathematical mean. The idea is very simple, first, you computing a summary of all elements of your data sample, then divide the summary value on the number of elements in a sample. The outcome of this division is a mathematical mean.
The use cases for the average calculations are numerous. When summary calculates capacity, one of the major use for the average is to calculate a average use across the telemetry data samples, organized in Telemetry Matrix. For example, when you are looking for an average CPU utilization for all nodes in your computing cluster, this is call for an application the average calculation across row aggregation in Telemetry Matrix. Another example, would be an estimation of average utilization of some resource on host, like average free disk space across data partitions. This would be also a call for the average computation over a column aggregation. Those two examples giving away the idea, that the average computation is commonly used as a way to "hide" a values of individual telemetry data, by producing (via calculation) a single value, which representing an average value across a group. So, if you are to observe a groups of similar telemetry items, calculating an average of the values it is a smart thing to do.
The concept of Minmax delta is very simple. It is a mathematical difference between the maximum and minimum elements in a dataset.
What type of use cases could be attributed to a Minmax delta. The most important one is uniformity. Every time, when you want to be ensure, that the minimum and maximum in the sequence of the telemetry data are close to each other, so the chance that the telemetry across the sample do not have a clearly identifiable spikes or dips, Minmax delta across row or column aggregation will give you that type of assessment. Where this is important ? Every time, when you are observe the cluster of some entities (applications, network devices, EC2 instances, Kubernetes pods, so you've got the idea), you want to be sure, that your cluster is loaded more or less evenly. No spikes over some nodes which can create a capacity or useage problems, nether dips which is usual indication of ether underload, or other underlaying problems. So, every time when you Minmax delta is rising, you have less uniformity over your entities.
Not even by a longer shot. I did not cover variability, different types of means calculations, and other useful mathematical and statistical computation over data in Telemetry Matrix. But the scope of the review of all those functions and it's applicable use will bring us way outside of the topic discussed in this article - aggregation.
At this point, I think that you do have a more or less firm understanding of what is the telemetry aggregation and why it is important to use it whenever applicable. I will boldly state, that non-aggregated data are almost useless and will not help you to determine a root cause of the problem or state of your infrastructure. Remember, more analysis and computation you are performing in your head when you are working with monitoring and observability data, lesser effective you are. So, give yourself a hand, do not fall in the same situation as an operator in Chernobyl Power Plant, who at the moment of crisis did everything by the book, but had too many data presented to him in order to properly assess the reality. Aggregate your metrics.