Paper Written by Shen-Shyang Ho, and Harry Wechsler
"Automatic anomaly detection is critical in today's world where the sheer volume of data makes it impossible to tag outliers manually."
The goal of this project is to implement the martingale algorithm into a dataset which has already been pre-processed for anomalies, and then compare the martingale detection algorithms performance with Yahoos anomaly flags.
The papers for the project propose a method to detect changes in a time-series data (data streams) by utilizing a martingale framework. A martingale is a model of fair game, where knowledge of the past events never help predict future outcomes. In particular, a fair game always, will have an expected value of 0, as the number of rounds approaches infinity.
The exchangeability method refers to a sequence of random variables that are independent and identically distributed that even when a finite permutation is applied, the joint probability distribution of the permuted sequence is the same as the original permuted sequence.
Applying the theory above change detection is sought out by violating the exchangeability method. A hypothesis test is initially set up:
H0: “no change in the data stream” against the alternative H1: “a change occurs in the data stream.”
The martingale value then is calculated for each new point in the data stream and is compared with the threshold value. If the martingale value is determined to be above the threshold then the H0 hypothesis is rejected. When a change is detected, a record of the detection will be made, and the data stream will be reset to 0, starting a new loop of the current point in the data stream as to not affect the values with the anomaly.
Calculating the martingale value is obtained by initially finding the p-value of each data point by using the strangeness measure. The strangeness measure is found through multiple possible methods from which the paper includes: kNN, SVM, Clustering, or Regression. Each one of these methods can be used to identify how much the data point varies from the previous data from the past.
Ideally when a time series data has no anomalies, the p-values calculated follow a uniformly distributed pattern on [0,1].
By Ramez Tarazi
CS109A - Harvard Project