Anomaly Detection

12/19/20

Intro to anomaly detection methods

What is anomaly detection?

Anomaly detection is the study of rare events [1]. Detection and prediction of rare events are extremely important for things such as fraud detection as well as an assessment of operating conditions of machinery.

What industries care about anomaly detection?

Anomaly detection can turn up in every field in some shape or form from finance to engineering.

How do you detect anomalies?

There are a number of different methodologies you can use to detect outliers (also known as anomalies). In essence an outlier is a data point 'different' from other data points in the time series. Where different can be defined as:

Data points that are further than two standard deviations from the mean.
Data points that outside the normal variance (e.g. using Principal Component Analysis)

What anomaly detection methods are available?

For an detailed overview see Ahman et al (2017).

Some examples are:

Isolation forest - https://en.wikipedia.org/wiki/Isolation_forest
Long Short Term Memory networks

What python packages are available for anomaly detection?

scikit-learn (Robust covariance, One-Class SVM, Isolation Forest, Local Outlier Factor)
Tensorflow and PyTorch (LSTM)
stumpy (discord discovery)
pyod (outlier detection)

Overview of the blog post

In this blog post we will take a time series dataset from Numenta Anomaly Benchmark (NAB) and detect outliers using a number of different methods.

We'll look at the machine_temperature_system_failure.csv example. While this dataset does have labels i'm treating it as an unsupervised problem and fitting a model to the data.

Analysis

Download the data

$ git clone https://github.com/numenta/NAB.git

$ cd NAB

Plot the data

Load packages

import pandas as pd

import numpy as np

import json

import matplotlib.pyplot as plt

from matplotlib.patches import Rectangle

Parse the data, label the anomalies and the windows associated with the anomalies:

df = pd.read_csv(

"data/realKnownCause/machine_temperature_system_failure.csv", index_col="timestamp"

)

dict = json.load(open("labels/combined_labels.json"))

anoms = (

pd.DataFrame.from_dict(dict, orient="index")

.T["realKnownCause/machine_temperature_system_failure.csv"]

.dropna()

).rename("anomalies")

anom_df = df[df.index.isin(anoms)]

dict = json.load(open("labels/combined_windows.json"))

windows = pd.DataFrame(

(

pd.DataFrame.from_dict(dict, orient="index")

.T["realKnownCause/machine_temperature_system_failure.csv"]

.dropna()

).to_list(),

columns=["start", "stop"],

)

manipulate the data (searched window points by hand) and plot:

_df = df.copy()

_df['ix'] = np.arange(0, len(_df))

_anom_df = _df[_df.index.isin(anoms)]

rect1 = Rectangle((2126, df.min()), 566, df.max(), color='gray', alpha=0.5)

rect2 = Rectangle((3703, df.min()), 566, df.max(), color='gray', alpha=0.5)

rect3 = Rectangle((16057, df.min()), 566, df.max(), color='gray', alpha=0.5)

rect4 = Rectangle((19232, df.min()), 566, df.max(), color='gray', alpha=0.5)

fig, ax = plt.subplots(figsize=(15, 10))

df.plot(ax=ax)

_anom_df.plot(kind='scatter', x='ix', y='value', c='red', s=100, ax=ax);

fig.autofmt_xdate()

ax.add_patch(rect1)

ax.add_patch(rect2)

ax.add_patch(rect3)

ax.add_patch(rect4)

The blue line is the data; red dots are the anomalous points and the gray boxes are the anomalous windows

Simple statistical models

K-means

Choose 10 clusters (prior elbow curve) and get the distance of each data point from the centroid. Assume a small fraction is anomalous (0.001 %) and find those points that are further away from the cluster centroid.

from sklearn.cluster import KMeans

km = KMeans(n_clusters=10)

km.fit(df)

_df = df.copy().reset_index()

_df['cluster'] = km.labels_

_df['centroid'] = km.cluster_centers_[_df['cluster'] - 1]

_df['distance_from_centroid'] = (_df['value'] - _df['centroid']).abs()

outliers_fraction = 0.001

anom_points = _df.nlargest(int(len(_df) * outliers_fraction), 'distance_from_centroid')

Plot the anomalies

_anom_df2 = anom_points.reset_index().set_index('timestamp')[['value', 'index']]

_df = df.copy()

_df['ix'] = np.arange(0, len(_df))

_anom_df = _df[_df.index.isin(anoms)]

rect1 = Rectangle((2126, df.min()), 566, df.max(), color='gray', alpha=0.5)

rect2 = Rectangle((3703, df.min()), 566, df.max(), color='gray', alpha=0.5)

rect3 = Rectangle((16057, df.min()), 566, df.max(), color='gray', alpha=0.5)

rect4 = Rectangle((19232, df.min()), 566, df.max(), color='gray', alpha=0.5)

fig, ax = plt.subplots(figsize=(15, 10))

df.plot(ax=ax)

_anom_df.plot(kind='scatter', x='ix', y='value', c='red', s=100, ax=ax);

_anom_df2.plot(kind='scatter', x='index', y='value', c='green', s=100, ax=ax);

fig.autofmt_xdate()

ax.add_patch(rect1)

ax.add_patch(rect2)

ax.add_patch(rect3)

ax.add_patch(rect4)

Green dots are anomalies points

This method finds the two anomalies which have the largest deviations from the mean but doesn't find the two more subtle anomalies.

Isolation forest

An isolation forest can provide anomalies defined as points outside outside of a N-dimensional distribution.

from sklearn.preprocessing import StandardScaler

from sklearn.ensemble import IsolationForest

scaler = StandardScaler()

df_scaled = scaler.fit_transform(df)

isofor = IsolationForest(contamination=outliers_fraction, random_state=42)

_df = df.copy()

_df['index'] = np.arange(0, len(_df))

_df['ypred'] = isofor.fit_predict(df_scaled)

_anom_df2 = _df[_df['ypred'] == -1]

This methods pick up the most extreme anomalous points (the second anomaly).

One Class SVM

A one-class Support Vector Machine (SVM) is a way to detect outliers by identifying if a data points is "unlike" the other data points (hence the one class). The method is based on SVM's where the data is projected through a non-linear function to a space with a higher dimension which makes it easier to separate data point along a hyper-plane.

from sklearn.svm import OneClassSVM

ocsvm = OneClassSVM(nu=outliers_fraction)

_df = df.copy()

_df['index'] = np.arange(0, len(_df))

_df['ypred'] = isofor.fit_predict(df_scaled)

_anom_df2 = _df[_df['ypred'] == -1]

It's difficult to get extreme values with a One Class SVM as shown in this hand exercise here. This is a known problem with One Class SVM (scikit-learn doc)

Python Pacakages

PyOD

PyOD is a python package developed by Yue Zhao which provides a number of different methods for outlier detection including Linear, Proximity-based, Probabilistic, Ensembles and Neural Networks. In addition, it offers ways to combine them. The example notebooks are also well fleshed out.

Here i'm going to use novel method: Copula Based Outlier Detector (Li et al, 2020).

from pyod.models.copod import COPOD

clf = COPOD(contamination=outliers_fraction)

_df['index'] = np.arange(0, len(_df))

_df['ypred'] = clf.fit_predict(df_scaled)

_anom_df2 = _df[_df['ypred'] == 1]

This one behaves more like k-means than isolation forecast and one class SVM.

a more complex method is a deep learning technique of Auto-Encoding:

from pyod.models.vae import VAE

clf = VAE(encoder_neurons=[1], decoder_neurons=[1], contamination=outliers_fraction)

_df['index'] = np.arange(0, len(_df))

_df['ypred'] = clf.fit_predict(df_scaled)

_anom_df2 = _df[_df['ypred'] == 1]

The results are similar to other methods. In this case there is only one 'feature' so the encoding doesn't do much but it's an example of how easy it is to use.

TODS

TODS is an automated Time Series Outlier Detection System. It offers complex algorithms and is not trivial to use. It's more of a code dump but you learn a lot looking through the source code

Further work

Other ways to expand on this blog include:

Treating it a supervised problem and fitting a model to the anomalous points
Treat it as streaming data and continually do fit and prediction at each time step. This may be the best approach to capture the subtle anomalies

Anomaly Detection

12/19/20

What is anomaly detection?

What industries care about anomaly detection?

How do you detect anomalies?

What anomaly detection methods are available?

What python packages are available for anomaly detection?

Overview of the blog post

Analysis

Download the data

Plot the data

Simple statistical models

K-means

Isolation forest

One Class SVM

Python Pacakages

PyOD

TODS

Further work

Further reading