The Relicans

loading...
Cover image for Filter for a data that you did not see before.

Filter for a data that you did not see before.

Vladimir Ulogov
30+ years of coding and taming of Unix daemons
・5 min read

What we are looking for ?

In the business of observability, we are trying to comprehend the processes happening on the target for our observation, through the use of telemetry data. Different types and classes of telemetry data are telling us a different stories and you do want to pay attention to all those stories, to get complete observation. But, first problem that you will see heads on, it is almost impossible to pre-define and pre-determine all variations of the data, all patterns of the data and therefore to recognize and comprehend all stories that your target is telling to you. Sometimes, you are not ready even to recognize a story that hidden in the data, because you are not aware what some pattern in the data means, you are not able even to see and immediately recognize the pattern.

So, what to do ? What you can use, to help to recognize the patterns and stories associated with patterns ? How do you start ?

What is anomaly ?

You may already hear the whisper in the air - "search for anomaly in the data". And while this suggestion is partially covering "the issue", it is not a ready to use solution. Because, before you can treat this suggestion as solution, yo must understand what the "anomaly" is.

In the essence, the anomaly in telemetry data it is a combination of telemetry data values that was not till present moment was observed and categorized. If you observe this pattern in the telemetry data before, but did not categorize it, this combination is not an anomaly. This is what you can call "uncategorized pattern". So, you can see that you are dealing with one of the two cases, while observing telemetry data:

  • The data that we are observing, are matching the data pattern that we observed before, but we do not assign a category for this pattern.
  • We are truly observing this pattern for a first time. This is a new type of the telemetry behavior and we do need to ether categorize this pattern, or push this pattern to "uncategorized" category.

And in this article, we will be talking about importance of the "second type" or identifying the "previously unobserved patterns".

Previously unobserved pattern

While I've touched the importance of finding an "unknown" data in telemetry stream, I dod not focus on the practicality of searching for such data ? Why we shall look for it ?

In the other article, I've put a some emphasis on the fact that you must develop a habit of looking for a change in the values in telemetry data stream. And to the fact that there is no such thing as predefined "golden signals". But let's talk about patterns first, what to do if you do not have a them yet ? How you start to detect and define ones ? The answer on those questions is this: by observing an "unknown" values in the telemetry stream. "Unknown" values are defining a previously unobserved pattern. Every time, when you are observing value that "out of line", this might be indication for a new pattern. Or sometimes, just a glitch in the data. I will talk about detecting glitches in some other article. But how you can detect a new and unseen data ? You do have a multiple sources and thousands of telemetry items ? There are quite a few mathematical instruments that you can use and I will present you a very simple one you can start with. I call it a "Coefficient-based filter". Maybe it's already do have some fancier name, but let's use mine name in this article.

Filtering your data.

Imagine, you are prospecting for the gold. You have a river, sand and a pan. You put sand on pan and start to "wash" it in flowing water. Lightweight sand will be carried away by the water stream and more heavier gold will be collected at the bottom of the pan. That is exactly what we will do with our telemetry data. We will let "the sand telemetry" flow away and "gold telemetry" to be known to us.

So, we will build a filter, in which the telemetry values similar to one we've seen before will be signaling TRUE, which means "yes! you've seen something like this before", and unseen data will trigger FALSE, and this will be indication that you potentially have something new. Idea behind of such filter is very simple, first, we take a sample if telemetry data and alert us if the value that not in this set is arrived. But this filter will not be very useful, as numeric data not always 100% match. So, we do need to build the filter, that detect "that and similar values". Again, implementation if this filter for the single value is trivial: let's take coefficient of value variation, apply to the value and get the interval with start and end. And if the new value is within this interval, more than start and less than end values of the interval, then we consider this value as "known". And if new value is not falling within this interval range, this is an "unseen data". The next step, is to define those intervals for all values in our sample, and then merge overlapping intervals.

Alt Text

So, the values that we are looking for will be the ones that did not pass into an interval windows.

Show me the code

For the demonstration of how this approach will work, I will use an Interval module that I've created as a part of the standard library for the stack-based programming language BUND. This module will be officially released with release 1.2, but you can look at this code in Golang now. I will not expect that you are fully familiar with RPN notation or stack-based languages, but that will not be required. The generator "I" will create an empty interval set and place it on the stack. Operator "I/Coeff" will take an Interval Set and data from the stack and create configure Interval Set. println will do what you are expecting and will print stuff on console. Data set is a set of data elements between "(*" and ")", and first number is not a data, but Coefficient that used to build an intervals.

(* 0.2 1.0 2.0 3.0 5.0 1.0 100.0 150.0 10.0)  I I/Coeff println
Enter fullscreen mode Exit fullscreen mode

So, we are building Intervals for the values ( 1.0 2.0 3.0 5.0 1.0 100.0 150.0 10.0 ) with Variability Coefficient 0.2.

The outcome of this operation is our configured Interval Set

[ (0.8,1.2) (1.6,3.6) (4,6) (8,12) (80,180)  ]
Enter fullscreen mode Exit fullscreen mode

consisting of 5 intervals. If the number will be within this interval, we will consider it as "existing data", but if not, then it will be "unseen data" and worthy of alarm. This is a gold value we are looking for.

4.5 I/Test
Enter fullscreen mode Exit fullscreen mode

Will return True, the value 4.5 is not what we are looking for. But

45 I/Test
Enter fullscreen mode Exit fullscreen mode

will be an example of "unseen data"

Conclusion

This Coefficient-based Interval Filter could be an important instrument for identifying a new patterns or observing just an anomalies without concern about patterns (if that what you are looking for). Use of Coefficient's is not only mathematical method available to you. Instead hardcoded Coefficient, you can dynamically calculate Standard Deviation from the sample, and use it as Coefficient. Try it, let's see what kind of results you can get. Try another methods of creating filters that will detect "unseen data". I am considering this article as thought provoking exercise, rather than production solution.

Discussion (0)