Anomaly Detection
12/19/20
Intro to anomaly detection methods
What is anomaly detection?
Anomaly detection is the study of rare events [1]. Detection and prediction of rare events are extremely important for things such as fraud detection as well as an assessment of operating conditions of machinery.
What industries care about anomaly detection?
Anomaly detection can turn up in every field in some shape or form from finance to engineering.
How do you detect anomalies?
There are a number of different methodologies you can use to detect outliers (also known as anomalies). In essence an outlier is a data point 'different' from other data points in the time series. Where different can be defined as:
Data points that are further than two standard deviations from the mean.
Data points that outside the normal variance (e.g. using Principal Component Analysis)
What anomaly detection methods are available?
For an detailed overview see Ahman et al (2017).
Some examples are:
Isolation forest - https://en.wikipedia.org/wiki/Isolation_forest
Long Short Term Memory networks
What python packages are available for anomaly detection?
scikit-learn (Robust covariance, One-Class SVM, Isolation Forest, Local Outlier Factor)
Tensorflow and PyTorch (LSTM)
stumpy (discord discovery)
pyod (outlier detection)
Overview of the blog post
In this blog post we will take a time series dataset from Numenta Anomaly Benchmark (NAB) and detect outliers using a number of different methods.
We'll look at the machine_temperature_system_failure.csv example. While this dataset does have labels i'm treating it as an unsupervised problem and fitting a model to the data.
Analysis
Download the data
$ git clone https://github.com/numenta/NAB.git
$ cd NAB
Plot the data
Load packages
import pandas as pd
import numpy as np
import json
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
Parse the data, label the anomalies and the windows associated with the anomalies:
df = pd.read_csv(
"data/realKnownCause/machine_temperature_system_failure.csv", index_col="timestamp"
)
dict = json.load(open("labels/combined_labels.json"))
anoms = (
pd.DataFrame.from_dict(dict, orient="index")
.T["realKnownCause/machine_temperature_system_failure.csv"]
.dropna()
).rename("anomalies")
anom_df = df[df.index.isin(anoms)]
dict = json.load(open("labels/combined_windows.json"))
windows = pd.DataFrame(
(
pd.DataFrame.from_dict(dict, orient="index")
.T["realKnownCause/machine_temperature_system_failure.csv"]
.dropna()
).to_list(),
columns=["start", "stop"],
)
manipulate the data (searched window points by hand) and plot:
_df = df.copy()
_df['ix'] = np.arange(0, len(_df))
_anom_df = _df[_df.index.isin(anoms)]
rect1 = Rectangle((2126, df.min()), 566, df.max(), color='gray', alpha=0.5)
rect2 = Rectangle((3703, df.min()), 566, df.max(), color='gray', alpha=0.5)
rect3 = Rectangle((16057, df.min()), 566, df.max(), color='gray', alpha=0.5)
rect4 = Rectangle((19232, df.min()), 566, df.max(), color='gray', alpha=0.5)
fig, ax = plt.subplots(figsize=(15, 10))
df.plot(ax=ax)
_anom_df.plot(kind='scatter', x='ix', y='value', c='red', s=100, ax=ax);
fig.autofmt_xdate()
ax.add_patch(rect1)
ax.add_patch(rect2)
ax.add_patch(rect3)
ax.add_patch(rect4)
The blue line is the data; red dots are the anomalous points and the gray boxes are the anomalous windows
Simple statistical models
K-means
Choose 10 clusters (prior elbow curve) and get the distance of each data point from the centroid. Assume a small fraction is anomalous (0.001 %) and find those points that are further away from the cluster centroid.
from sklearn.cluster import KMeans
km = KMeans(n_clusters=10)
km.fit(df)
_df = df.copy().reset_index()
_df['cluster'] = km.labels_
_df['centroid'] = km.cluster_centers_[_df['cluster'] - 1]
_df['distance_from_centroid'] = (_df['value'] - _df['centroid']).abs()
outliers_fraction = 0.001
anom_points = _df.nlargest(int(len(_df) * outliers_fraction), 'distance_from_centroid')
Plot the anomalies
_anom_df2 = anom_points.reset_index().set_index('timestamp')[['value', 'index']]
_df = df.copy()
_df['ix'] = np.arange(0, len(_df))
_anom_df = _df[_df.index.isin(anoms)]
rect1 = Rectangle((2126, df.min()), 566, df.max(), color='gray', alpha=0.5)
rect2 = Rectangle((3703, df.min()), 566, df.max(), color='gray', alpha=0.5)
rect3 = Rectangle((16057, df.min()), 566, df.max(), color='gray', alpha=0.5)
rect4 = Rectangle((19232, df.min()), 566, df.max(), color='gray', alpha=0.5)
fig, ax = plt.subplots(figsize=(15, 10))
df.plot(ax=ax)
_anom_df.plot(kind='scatter', x='ix', y='value', c='red', s=100, ax=ax);
_anom_df2.plot(kind='scatter', x='index', y='value', c='green', s=100, ax=ax);
fig.autofmt_xdate()
ax.add_patch(rect1)
ax.add_patch(rect2)
ax.add_patch(rect3)
ax.add_patch(rect4)
Green dots are anomalies points
This method finds the two anomalies which have the largest deviations from the mean but doesn't find the two more subtle anomalies.
Isolation forest
An isolation forest can provide anomalies defined as points outside outside of a N-dimensional distribution.
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import IsolationForest
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
isofor = IsolationForest(contamination=outliers_fraction, random_state=42)
_df = df.copy()
_df['index'] = np.arange(0, len(_df))
_df['ypred'] = isofor.fit_predict(df_scaled)
_anom_df2 = _df[_df['ypred'] == -1]
This methods pick up the most extreme anomalous points (the second anomaly).
One Class SVM
A one-class Support Vector Machine (SVM) is a way to detect outliers by identifying if a data points is "unlike" the other data points (hence the one class). The method is based on SVM's where the data is projected through a non-linear function to a space with a higher dimension which makes it easier to separate data point along a hyper-plane.
from sklearn.svm import OneClassSVM
ocsvm = OneClassSVM(nu=outliers_fraction)
_df = df.copy()
_df['index'] = np.arange(0, len(_df))
_df['ypred'] = isofor.fit_predict(df_scaled)
_anom_df2 = _df[_df['ypred'] == -1]
It's difficult to get extreme values with a One Class SVM as shown in this hand exercise here. This is a known problem with One Class SVM (scikit-learn doc)
Python Pacakages
PyOD
PyOD is a python package developed by Yue Zhao which provides a number of different methods for outlier detection including Linear, Proximity-based, Probabilistic, Ensembles and Neural Networks. In addition, it offers ways to combine them. The example notebooks are also well fleshed out.
Here i'm going to use novel method: Copula Based Outlier Detector (Li et al, 2020).
from pyod.models.copod import COPOD
clf = COPOD(contamination=outliers_fraction)
_df['index'] = np.arange(0, len(_df))
_df['ypred'] = clf.fit_predict(df_scaled)
_anom_df2 = _df[_df['ypred'] == 1]
This one behaves more like k-means than isolation forecast and one class SVM.
a more complex method is a deep learning technique of Auto-Encoding:
from pyod.models.vae import VAE
clf = VAE(encoder_neurons=[1], decoder_neurons=[1], contamination=outliers_fraction)
_df['index'] = np.arange(0, len(_df))
_df['ypred'] = clf.fit_predict(df_scaled)
_anom_df2 = _df[_df['ypred'] == 1]
The results are similar to other methods. In this case there is only one 'feature' so the encoding doesn't do much but it's an example of how easy it is to use.
TODS
TODS is an automated Time Series Outlier Detection System. It offers complex algorithms and is not trivial to use. It's more of a code dump but you learn a lot looking through the source code
Further work
Other ways to expand on this blog include:
Treating it a supervised problem and fitting a model to the anomalous points
Treat it as streaming data and continually do fit and prediction at each time step. This may be the best approach to capture the subtle anomalies
Further reading
Anomaly detection is a huge field in itself so there is not shortage of further reading
Ahmed et al (2017) Unsupervised real-time anomaly detection for streaming data, Neurocomputing, 262, Pp 134-147 https://www.sciencedirect.com/science/article/pii/S0925231217309864
https://arxiv.org/abs/1805.06725 GANomaly: Semi-Supervised Anomaly Detection via Adversarial Training
https://www.kaggle.com/victorambonati/unsupervised-anomaly-detection
https://towardsdatascience.com/anomaly-detection-with-isolation-forest-visualization-23cd75c281e2
https://towardsdatascience.com/anomaly-detection-with-lstm-in-keras-8d8d7e50ab1b
https://github.com/yzhao062/anomaly-detection-resources
https://awesomeopensource.com/projects/anomaly-detection
http://odds.cs.stonybrook.edu/#table1
https://github.com/numenta/NAB
https://towardsdatascience.com/time-series-of-price-anomaly-detection-13586cd5ff46 - https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Time%20Series%20of%20Price%20Anomaly%20Detection%20Expedia.ipynb
https://eng.uber.com/monitoring-data-quality-at-scale/
https://www.kaggle.com/victorambonati/unsupervised-anomaly-detection#2.7-RNN
https://www.kaggle.com/boltzmannbrain/nab
https://github.com/shunsukeaihara/changefinder
https://pyod.readthedocs.io/en/latest/
https://www.kaggle.com/caesarlupum/starter-anomaly-strategy
https://github.com/zillow/luminaire
https://towardsdatascience.com/outlier-detection-with-hampel-filter-85ddf523c73d
https://tods-doc.github.io/getting_started.html
https://ff12.fastforwardlabs.com/
https://www.kaggle.com/tikedameu/anomaly-detection-with-autoencoder-pytorch