Fast Anomaly Detection for Streaming Data

Abstract

This paper introduces Streaming Half-Space-Trees

(HS-Trees), a fast one-class anomaly detector for

evolving data streams. It requires only normal data

for training and works well when anomalous data

are rare. The model features an ensemble of random

HS-Trees, and the tree structure is constructed

without any data. This makes the method highly

efficient because it requires no model restructuring

when adapting to evolving data streams. Our analysis

shows that Streaming HS-Trees has constant

amortised time complexity and constant memory

requirement. When compared with a state-of-theart

method, our method performs favourably in

terms of detection accuracy and runtime performance.

Our experimental results also show that

the detection performance of Streaming HS-Trees

is not sensitive to its parameter settings.

Published In:

IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence

Access the paper here:

http://dl.acm.org/citation.cfm?id=2283647

or http://ijcai.org/papers11/Papers/IJCAI11-254.pdf

Program for running the experiments

Program file name: fastanomaly_bytecode_released_at_google_site.zip

(see the file attached at the end of this page.)

Disclaimer: This program is provided for research purposes only, and it is provided solely for replicating my experimental results on a different computer. Also, it is not desiged for any commercial use. Use it at your own risk.

Datasets used

If you intend to make a head-to-head comparision with my method, you should use the datasets attached here as it simulates the exact sequence of instances read in by the programs.

Some addition information about the datasets:

The Covertype (aka the Forest Covertype) data which is available at UCI repository. For more details see:

http://archive.ics.uci.edu/ml/datasets

In the dataset file attached, instances of minority class have been moved to different parts of the stream. So it is

Mulcross dataset was generated from a synthetic data generator. For more details see:

David M. Rocke and David L. Woodruff. Identification of outliers in multivariate data.

Journal of the American Statistical Association,

91(435):1047–1061, 1996.

Other datasets and related references are as follows:

http://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data.

http://archive.ics.uci.edu/ml/datasets/Staddtlog+(Shuttle).

Using the source code

Please find the Java source code for the outlier detection program developed for IJCAI paper.

Just unzip the file "LevelJForest_outlierDetect_Ver4.8.1H_OneClassLogMass_Opt0.8_sizeLmt25.zip" and use Netbean IDE 7.0.1 to load the Project “Stochastic”.

The code assumes the following paths used:

String dataPath = "G:\\myMonashComputerBackup\\C_drive\\My Documents\\RPC_ExternalData\\";

String resultPath = "G:\\myMonashComputerBackup\\D_drive\\iForest\\HSTreeLevelresults\\";

If you use different path-names, you can modify the path; but you need to do so carefully.

You can test run the program in two ways:

(1) Use a batch file. See path: LevelJForest_outlierDetect_Ver4.8.1H_OneClassLogMass_Opt0.8_sizeLmt25\Stochastic\build\classes

You will find many batch files there used for various experiments. For example, try this one: test_4002_03_04_09_10_40_ensemble_release_version.bat

(2) Define the arguments in Project Property page.

Page updated

Google Sites

Report abuse