Do you possess unique potential? Are you an outlier?
Just like how Miles Morales, bitten by a radioactive spider from another dimension, stands out as 'The Original Anomaly', in machine learning, an "anomaly" refers to a data point that deviates significantly from the expected or normal behavior of most data points.
Source: Marvel Entertainment
Unexpected: They are rare occurrences that don't fit the established patterns in the data.
Informative: They can signal suspicious activity like fraud, security breaches, or even interesting discoveries.
Point Anomalies: Single data points that are considerably different from the rest. Example: A sudden spike in network traffic indicates a cyberattack.
Contextual Anomalies: Data points that are unusual within a specific context or period. Example: A high electricity reading in summer.
Collective Anomalies: A group of data points that deviate together, indicating an unusual pattern or event. Example: An unexpected drop in sales across multiple product lines.
How to detect an Anomaly?
Supervised: Requires labeled data with examples of both normal and anomalous behavior. Algorithms like SVM, KNN, or Neural Networks can be trained to classify new data points.
Unsupervised: Works with unlabeled data, assuming anomalies are rare and different from the majority. Techniques like clustering, density estimation, or autoencoders can be used to identify unusual patterns.
Semi-Supervised: Uses a small amount of labeled data along with a larger set of unlabeled data to improve anomaly detection.
Anomalies are often identified based on -
Density: Normal points occur in dense regions, while anomalies occur in sparse regions. Easy to spot!
Distance: The normal point is close to its neighbors and the anomaly is far from its neighbors. Can't hide!
Isolation: Anomalies are more susceptible to isolation due to their rarity and difference from the majority. Lone wolf!
Even though they are easy to find, what could go wrong?
Defining Normal: Establishing what "normal" behavior means can be tricky, especially in complex or evolving systems.
Imbalanced Data: Anomalies are usually rare, leading to imbalanced datasets that can make training challenging.
Adaptability: Systems need to adapt to changing data distributions and emerging types of anomalies.
Let's break it down using some statistical methods.
Data anomalies can be like finding a needle in a haystack, except the needle is trying to sabotage your entire operation. Empirical and robust covariance methods are your trusty metal detectors, helping you sift through the noise and pinpoint those critical deviations.
Empirical Covariance (The Traditional Approach) -
It is a straightforward method for estimating the covariance matrix of a dataset. It calculates the pairwise covariances between variables based on their observed values.
To put it simply, think of this as taking the average height and weight of everyone in the room, and then seeing how much each person's height and weight deviate from those averages.
If most people are of similar height and weight, the deviations will be small, and the covariance will be low. However, if there's a giant or a very petite person in the room, their measurements will significantly skew the averages, leading to a higher covariance.
Source: Nagwa Example of an Anomaly
Robust Covariance (The Outlier-Resistant Approach) -
The Minimum Covariance Determinant (MCD) estimator is designed to be less susceptible to the influence of outliers. They achieve this by identifying a subset of the data that is most likely to be free of outliers and then calculating the covariance matrix based on this subset.
Now, imagine there's a group of basketball players in the room. They are significantly taller and heavier than the average person. Using the traditional approach, their measurements would heavily influence the averages, making it seem like everyone in the room is much taller and heavier than they are.
Robust covariance is like having a bouncer at the door who only lets in people of "typical" height and weight. This way, the calculations of average height and weight are not influenced by the outliers (the basketball players), giving a more accurate representation of the majority of people in the room.
- It is simple and computationally efficient.
- Suitable for normally distributed data.
- Highly sensitive to outliers.
- Assumes a linear relationship between variables.
- Computationally more expensive.
- Requires careful selection of parameters.
- Less sensitive to outliers.
- Can handle non-linear relationships.
Both empirical and robust covariance can be used for outlier detection through the Mahalanobis distance. This distance measures how far a data point is from the center of the data, taking into account the covariance structure.
Outliers are those points with a Mahalanobis distance exceeding a certain threshold.
How do you choose between them?
Well, the choice between empirical and robust covariance depends on the characteristics of your data.
- If your data is mostly normally distributed and free of significant outliers, empirical covariance may suffice.
- If you suspect the presence of outliers or non-linear relationships, robust covariance is a more reliable option.
I know it's a big article but remember to
- Always visualize your data to get a better understanding of its distribution and potential outliers.
- Consider using multiple outlier detection techniques and compare their results.
- Domain knowledge is key! Interpret the detected outliers in the context of the problem to derive meaningful conclusions.
Check out the project inspired by scikit-learn - GitHub
Get in touch at jain.van@northeastern.edu