Anomaly detection is a strategy used when solving problems where the likelihood of occurrence of the good/common/frequent case is higher than the likelihood of occurrence of the bad/anomalous/rare case. In other words, we can think of an anomaly detection system as a 2-class classification problem where one of the classes is known where as the other class is not known. In a very specific scenario, a rule-based system can be considered as a degenerate version of an anomaly detection system because all those cases that do not match the rule fall under the the "other class". However, in technical literature and common accepted understanding, a rule-based system is not considered as anomaly detection system, because an anomaly detection system is expected to automatically discover the rules from data.
In common application there are two broad categories of anomaly detection systems, namely, those that discover the rules for the good/common/frequent case but capture them in a mathematical model (model-driven) and those that can discover the rules for good/common/frequent case but capture them in a more human readable form such as a regex or a database query (grammar-driven). The latter case requires an additional step of post-processing of the mathematical model. This step is usually difficult to achieve because a mathematical computation cannot always be translated into a grammar especially when there are complex interactions between high-dimensional features. In fact, model-driven anomaly detection has become popular because of these complex interactions. However, in some applications where human intervention is necessary, grammar-driven approaches are preferred.
Another aspect to note is that false positives and false negatives are two sides of the same coin. Many anomaly detection implementations are built keeping in mind only one of these metrics thereby faring poorly on the other metric. The interesting part is, it is very easy to observe false positives but one never knows if there were false negatives unless there are explicit tests that are run to quantify the anomaly detector's performance. The rule of thumb is to use as much context as possible when detecting an anomaly so that false positives and negatives are handled together within the mathematical model.
Region bounding approaches: In this approach the algorithm creates a n-dimensional bounding box or a n-sphere. Popular algorithm in this category is the kd-tree. This algorithm is easy to implement and assumes that data can be separated using linear partitions. The adjoining figure illustrates how a kd-tree partitions datapoints represented by 2D feature vectors. The datapoints are partitioned into regions using a recursive approach using a heuristic optimization criterion such as variance, or uncertainty or density.
There are two modes of using this approach. In a streaming mode detection, once the model is built with historical data, input datapoint is traversed along the tree to find the bounding region where that datapoint falls. If it falls in a dense region we classify that point as not anomalous. If it falls in a very low density region we use a distance measure to qualify the datapoint as an anomaly (unknown-unknown) or as uncertain (known-unknown). Usually the datapoint is classified as uncertain if the distance between the datapoint being evaluated and the datapoints in that bounded region is within an acceptable threshold and if that region has low number of datapoints compared to other high-density regions. The datapoint is classified as anomalous if the distance is outside the acceptable threshold.
In a batch mode, the kd-tree is built from the data and low density regions are identified. Datapoints in these low density regions are classified as outliers/anomalies. Here, the notion of a known-unknown is not apparent but the idea of qualifying a datapoint as an unknown-unknown *anomaly) is apparent.
high-dimensional region partitioning
search is logarithmic in the number of dimensions
can be used to determine the known-unknown as well as unknown-unknown by defining a threshold over the region
A datapoint can be classified as an unknown-unknown only when the baseline is fixed
Baselines include known-knowns as well as known-unknowns
Clustering approaches: These approaches are a generalization over the region bounding approaches. Data is partitioned into clusters using one of many well known clustering algorithms for example DBSCAN or OPTICS. Once clusters are identified, operations on clusters are similar to that described in the case of region bounding approaches.
Summarization by converting to a trained model: One can observe that the above mentioned methods require one to store the datapoints used for building the models. On these stored datapoints one eventually performs some form of a distance computation. However, the number of points can become very large and number of partitions can also increase many fold. One common trick used is to summarize the datapoints into a 2-class classifier as shown in the adjoining figure. Once the classifier is trained, the anomaly detection problem becomes a classification problem. The unknown-unknowns are identified based on the extent to which the test datapoint deviates from the detected known-unknown.
One-class models: These techniques model the anomaly detection problem as a 1-class classification problem where all the observed datapoints are assumed to be the known class. One-calss SVM is a popular algorithm in this class. Non-linear kernels help create separating regions such that any datapoint that is outside the bounded region is an outlier or an anomaly. The outcome is similar to the two class classifier described above but a separating hyperplane is created via an optimization framework.
Open-set learning: Classical supervised learning is a closed-set learning framework. This means, the trained model can recognize only those classes of data that were observed during training. For example, imagine that all the datapoints clustered tightly into a single cluster in the examples given above. Then there is no way a 2-class classifier can be built. This is where a one-class SVM comes into play. But imagine there is a need for identifying the different classes of non-anomalous classes and yet we want to identify datapoints that do not belong to any of these classes. In this case one-class SVM would not work and we need to use open-set learning. One approach is to relax the decision boundaries assuming a gaussian distribution, by factoring it into the optimization framework.
Among all these techniques some are not suitable for categorical data where as some are quiet versatile across data types. However, there is one persisting problem. All these algorithms lack information about the conditions under which the anomaly detection must be performed. We refer to these conditions as the context.
Here is where it gets tough. It gets harder to describe meaning of context and even harder to incorporate context into a mathematical framework. One school of practitioners believe that context is nothing but additional features describing the datapoint, where as another school of practitioners prefer to separate context from the models by creating as many models as the contexts.
Let us look at context in a specific setting where we need to detect the anomalous behaviour of a compute machine. For this discussion let the machine be characterized by properties shown as the below JSON object
property: {
"system_parameters": {
"memory_usage": 10GB
"cpu_usage": 82%
},
"access_load": {
"number_of_users": 15
"number_of_programs": 4
},
"program": ["top", "ps", "safari", "pycharm"],
"when": {
"day": "Monday"
"period": "forenoon"
}
}
property: {
"system_parameters": {
"memory_usage": 3GB
"cpu_usage": 70%
},
"access_load": {
"number_of_users": 1
"number_of_programs": 2
},
"program": ["backup", "transfer"],
"when": {
"day": "Friday"
"period": "evening"
}
}
These properties are a mix of numerical, and categorical values. Given these properties how do we proceed with building an anomaly detector? Keep in mind, once the anomaly has been detected, we need to explain why it is an anomaly by citing those dimensions responsible for the anomaly! Attribution is the most challenging part for any machine learning based implementation and this problem aggravates when the number of dimensions grow. One should not confuse attribution to feature importance. Attribution is after the anomaly was discovered where as feature importance is discovered during the modeling.
Detect anomaly in each property, and if an anomaly is detected flag the behaviour of the compute engine as an anomaly and cite that property as the cause for the anomaly.
The adjoining images describe a weak detector that is oblivious to other properties of the compute engine. This weak detector models the anomaly as "any value of cpu_usage greater than the max value that we have ever seen". This works perfectly well on most days and times of the day when the cpu_usage is close to this max value. However, imagine that on a Friday during the afternoon session, we do get cpu_usage just below the max value. This is an anomaly which goes unnoticed (false negative)
The cpu_usage varies over the day. On Monday it peaks during the forenoon (FN) and for a short while during the afternoon (AN)
The cpu_usage tapers off during afternoon (AN) on a Friday and has a sudden spike in the evening (E) when a batch job runs
What is the issue: For the moment let us forget that this is a weak detector. The main issue is that the model needs to be built for each property.
When does this approach work: It works when properties are independent. For example, if the "cpu_usage" is independent of the "number_of_programs" or the "number_of_users" is independent of the "day" and "period" and so on.
How can the weak detector be improved with context?
The adjoining figure shows how the period and day were used to create branching conditions. Basically tuples of properties were created. The day and the period were used to segment the cpu_usage. For each segment a different weak model was created. Now the same weak detector has the additional context for what the max value would be in that specific segment and this solves for the false negative mentioned earlier.
What is the complexity?
We have as many models as the number of branches. In this example we talked about a very simple model that uses the max value of cpu_usage. What if this model was a LSTM neural network. Imagine how many models would have to be trained, deployed, loaded and switched. It is not uncommon to find such implementations when one is targeting high accuracy for a very specific metric. By the way, there is a practical way to built a single LSTM model taking into account all the context, but that approach warrants a separate discussion.
How do we describe the anomaly?
In this approach the anomaly would be described as follows: we observed the cpu_usage as being uncharacteristic during the forenoon on a monday.
What is the challenge in generalizing this approach?
How do we know what properties to use in order to branch?
What if there were say 50 such properties? What would happen to the tree and how would the branching take place?
We are still detecting anomaly only on one property namely cpu_usage (ofcourse with additional context), but what about the other properties? For example what if the compute machine had a cpu_usage above normal on a Monday but a new program was found running what does it mean?
This approach uses all the properties of the compute engine with any one of the anomaly detection algorithms described earlier. However these algorithms would require the properties JSON to be converted into a vector. Mapping the JSON into a vector involves many tricks when different properties are of different domains and have different ranges. Further we need to define a distance measure that is appropriate for each dimension.
Mapping to a vector representation
Group properties into 3 types of domains namely: numerical, enum, and sets
Reduce the range of numerical domains by discretizing the values into bins or by representing the numerical domain as a distribution
one could divide the range for a particular numerical property into N evenly spaced bins and use each bin as a dimension in the feature vector
another approach is to build a box-and-whisker model from the data in that dimension, use 3 dimensions corresponding to values greater than the higher whisker, within the IQR and lower than the lower whisker. Every data value will now be converted into a 3 boolean values.
Convert enums into an embedding for each of the enum properties
Keep sets as sets and if they are very big summarize them as a MinHash
Adjoining figure shows one such vector representation for one datapoint.
Defining the similarity across dimensions
The numerical and boolean representations lend themselves to standard numerical distance measures. However the sets need to be handled using intersection measures such as jaccard distance.
These similarity measures need to be used within the affinity, density and distance models in the algorithms described earlier.
How do we describe the anomaly?
In this approach the anomaly would be described as follows: this combination of factors look abnormal. Digging deeper it would be described as: During the forenoon session on a monday, the number_of_users and the cpu_usage and the memory_usage and the programs_running is uncharacteristic.
Approach-1
Uses predetermined combination of properties to find anomaly in one property
Domain expertise is needed to decide what properties form the context
Because a single numerical property was the target variable and the context was a set of predetermined properties, the implementation is simple
As the number of factors in the context increases, the memory and computational complexity goes up.
Approach-2
Uses all properties to find an anomalous combination of properties
Context is latent in the combination of properties
In order to factor in multiple domains and ranges of properties one needs an elaborate transformation of data into vector-space representation
A large variety of ML approaches can be used to improve accuracy. For example: ensemble detection, chaining, non-linear dimensionality reduction followed by clustering etc.
Dimensionality can be controlled and we need not store all the datapoints thereby making the computation and storage much more efficient
People love to interfere with machine learning algorithms. The funny part is that the math behind the algorithms was derived assuming a self-drive mode. But very often these algorithms produce unexpected results and that makes humans feel that they need to control how the model works. However, humans cannot comprehend high-dimensional spaces and distances in these spaces. So the choice we have is to tune the importance of each property in the vector representation, and to then allow the model to build itself. Humans can select the weight for each property via an user interface by making sure that weights are in a bounded range (say between 0 and 1). The vector representation now needs to use the assigned weights to change the base representation. Model's parameters should remain outside the control of humans.