Anomaly Detection

12/19/20

Intro to anomaly detection methods

What is anomaly detection?

Anomaly detection is the study of rare events [1]. Detection and prediction of rare events are extremely important for things such as fraud detection as well as an assessment of operating conditions of machinery.

What industries care about anomaly detection?

Anomaly detection can turn up in every field in some shape or form from finance to engineering.

How do you detect anomalies?

There are a number of different methodologies you can use to detect outliers (also known as anomalies). In essence an outlier is a data point 'different' from other data points in the time series. Where different can be defined as:

  • Data points that are further than two standard deviations from the mean.

  • Data points that outside the normal variance (e.g. using Principal Component Analysis)

What anomaly detection methods are available?

For an detailed overview see Ahman et al (2017).

Some examples are:

What python packages are available for anomaly detection?

Overview of the blog post

In this blog post we will take a time series dataset from Numenta Anomaly Benchmark (NAB) and detect outliers using a number of different methods.

We'll look at the machine_temperature_system_failure.csv example. While this dataset does have labels i'm treating it as an unsupervised problem and fitting a model to the data.

Analysis

Download the data

$ git clone https://github.com/numenta/NAB.git

$ cd NAB

Plot the data

Load packages

import pandas as pd

import numpy as np

import json

import matplotlib.pyplot as plt

from matplotlib.patches import Rectangle

Parse the data, label the anomalies and the windows associated with the anomalies:

df = pd.read_csv(

"data/realKnownCause/machine_temperature_system_failure.csv", index_col="timestamp"

)


dict = json.load(open("labels/combined_labels.json"))

anoms = (

pd.DataFrame.from_dict(dict, orient="index")

.T["realKnownCause/machine_temperature_system_failure.csv"]

.dropna()

).rename("anomalies")

anom_df = df[df.index.isin(anoms)]


dict = json.load(open("labels/combined_windows.json"))

windows = pd.DataFrame(

(

pd.DataFrame.from_dict(dict, orient="index")

.T["realKnownCause/machine_temperature_system_failure.csv"]

.dropna()

).to_list(),

columns=["start", "stop"],

)

manipulate the data (searched window points by hand) and plot:

_df = df.copy()

_df['ix'] = np.arange(0, len(_df))

_anom_df = _df[_df.index.isin(anoms)]

rect1 = Rectangle((2126, df.min()), 566, df.max(), color='gray', alpha=0.5)

rect2 = Rectangle((3703, df.min()), 566, df.max(), color='gray', alpha=0.5)

rect3 = Rectangle((16057, df.min()), 566, df.max(), color='gray', alpha=0.5)

rect4 = Rectangle((19232, df.min()), 566, df.max(), color='gray', alpha=0.5)


fig, ax = plt.subplots(figsize=(15, 10))

df.plot(ax=ax)

_anom_df.plot(kind='scatter', x='ix', y='value', c='red', s=100, ax=ax);

fig.autofmt_xdate()

ax.add_patch(rect1)

ax.add_patch(rect2)

ax.add_patch(rect3)

ax.add_patch(rect4)

The blue line is the data; red dots are the anomalous points and the gray boxes are the anomalous windows

Simple statistical models

K-means

Choose 10 clusters (prior elbow curve) and get the distance of each data point from the centroid. Assume a small fraction is anomalous (0.001 %) and find those points that are further away from the cluster centroid.

from sklearn.cluster import KMeans


km = KMeans(n_clusters=10)

km.fit(df)


_df = df.copy().reset_index()

_df['cluster'] = km.labels_

_df['centroid'] = km.cluster_centers_[_df['cluster'] - 1]

_df['distance_from_centroid'] = (_df['value'] - _df['centroid']).abs()


outliers_fraction = 0.001

anom_points = _df.nlargest(int(len(_df) * outliers_fraction), 'distance_from_centroid')

Plot the anomalies

_anom_df2 = anom_points.reset_index().set_index('timestamp')[['value', 'index']]

_df = df.copy()

_df['ix'] = np.arange(0, len(_df))

_anom_df = _df[_df.index.isin(anoms)]

rect1 = Rectangle((2126, df.min()), 566, df.max(), color='gray', alpha=0.5)

rect2 = Rectangle((3703, df.min()), 566, df.max(), color='gray', alpha=0.5)

rect3 = Rectangle((16057, df.min()), 566, df.max(), color='gray', alpha=0.5)

rect4 = Rectangle((19232, df.min()), 566, df.max(), color='gray', alpha=0.5)


fig, ax = plt.subplots(figsize=(15, 10))

df.plot(ax=ax)

_anom_df.plot(kind='scatter', x='ix', y='value', c='red', s=100, ax=ax);

_anom_df2.plot(kind='scatter', x='index', y='value', c='green', s=100, ax=ax);

fig.autofmt_xdate()

ax.add_patch(rect1)

ax.add_patch(rect2)

ax.add_patch(rect3)

ax.add_patch(rect4)

Green dots are anomalies points

This method finds the two anomalies which have the largest deviations from the mean but doesn't find the two more subtle anomalies.

Isolation forest

An isolation forest can provide anomalies defined as points outside outside of a N-dimensional distribution.

from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import IsolationForest


scaler = StandardScaler()

df_scaled = scaler.fit_transform(df)

isofor = IsolationForest(contamination=outliers_fraction, random_state=42)

_df = df.copy()

_df['index'] = np.arange(0, len(_df))

_df['ypred'] = isofor.fit_predict(df_scaled)

_anom_df2 = _df[_df['ypred'] == -1]

This methods pick up the most extreme anomalous points (the second anomaly).

One Class SVM

A one-class Support Vector Machine (SVM) is a way to detect outliers by identifying if a data points is "unlike" the other data points (hence the one class). The method is based on SVM's where the data is projected through a non-linear function to a space with a higher dimension which makes it easier to separate data point along a hyper-plane.

from sklearn.svm import OneClassSVM


ocsvm = OneClassSVM(nu=outliers_fraction)

_df = df.copy()

_df['index'] = np.arange(0, len(_df))

_df['ypred'] = isofor.fit_predict(df_scaled)

_anom_df2 = _df[_df['ypred'] == -1]

It's difficult to get extreme values with a One Class SVM as shown in this hand exercise here. This is a known problem with One Class SVM (scikit-learn doc)

Python Pacakages

PyOD

PyOD is a python package developed by Yue Zhao which provides a number of different methods for outlier detection including Linear, Proximity-based, Probabilistic, Ensembles and Neural Networks. In addition, it offers ways to combine them. The example notebooks are also well fleshed out.

Here i'm going to use novel method: Copula Based Outlier Detector (Li et al, 2020).

from pyod.models.copod import COPOD

clf = COPOD(contamination=outliers_fraction)

_df['index'] = np.arange(0, len(_df))

_df['ypred'] = clf.fit_predict(df_scaled)

_anom_df2 = _df[_df['ypred'] == 1]

This one behaves more like k-means than isolation forecast and one class SVM.

a more complex method is a deep learning technique of Auto-Encoding:

from pyod.models.vae import VAE

clf = VAE(encoder_neurons=[1], decoder_neurons=[1], contamination=outliers_fraction)

_df['index'] = np.arange(0, len(_df))

_df['ypred'] = clf.fit_predict(df_scaled)

_anom_df2 = _df[_df['ypred'] == 1]

The results are similar to other methods. In this case there is only one 'feature' so the encoding doesn't do much but it's an example of how easy it is to use.

TODS

TODS is an automated Time Series Outlier Detection System. It offers complex algorithms and is not trivial to use. It's more of a code dump but you learn a lot looking through the source code

Further work

Other ways to expand on this blog include:

  • Treating it a supervised problem and fitting a model to the anomalous points

  • Treat it as streaming data and continually do fit and prediction at each time step. This may be the best approach to capture the subtle anomalies

Further reading

Anomaly detection is a huge field in itself so there is not shortage of further reading


Ahmed et al (2017) Unsupervised real-time anomaly detection for streaming data, Neurocomputing, 262, Pp 134-147 https://www.sciencedirect.com/science/article/pii/S0925231217309864

https://arxiv.org/abs/1805.06725 GANomaly: Semi-Supervised Anomaly Detection via Adversarial Training

https://medium.com/pinterest-engineering/building-a-real-time-anomaly-detection-system-for-time-series-at-pinterest-a833e6856ddd

https://www.kaggle.com/victorambonati/unsupervised-anomaly-detection

https://towardsdatascience.com/anomaly-detection-with-isolation-forest-visualization-23cd75c281e2

https://towardsdatascience.com/anomaly-detection-with-lstm-in-keras-8d8d7e50ab1b

https://github.com/yzhao062/anomaly-detection-resources

https://awesomeopensource.com/projects/anomaly-detection

http://odds.cs.stonybrook.edu/#table1

https://github.com/numenta/NAB

https://towardsdatascience.com/lstm-autoencoder-for-extreme-rare-event-classification-in-keras-ce209a224cfb

https://towardsdatascience.com/time-series-of-price-anomaly-detection-13586cd5ff46 - https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Time%20Series%20of%20Price%20Anomaly%20Detection%20Expedia.ipynb

https://eng.uber.com/monitoring-data-quality-at-scale/

https://www.kaggle.com/victorambonati/unsupervised-anomaly-detection#2.7-RNN

https://www.kaggle.com/boltzmannbrain/nab

https://github.com/shunsukeaihara/changefinder

https://pyod.readthedocs.io/en/latest/

https://www.kaggle.com/caesarlupum/starter-anomaly-strategy

https://github.com/zillow/luminaire

https://towardsdatascience.com/outlier-detection-with-hampel-filter-85ddf523c73d

https://tods-doc.github.io/getting_started.html

https://ff12.fastforwardlabs.com/

https://www.kaggle.com/tikedameu/anomaly-detection-with-autoencoder-pytorch

https://adtk.readthedocs.io/en/stable/