Abstract
This paper introduces Streaming Half-Space-Trees
(HS-Trees), a fast one-class anomaly detector for
evolving data streams. It requires only normal data
for training and works well when anomalous data
are rare. The model features an ensemble of random
HS-Trees, and the tree structure is constructed
without any data. This makes the method highly
efficient because it requires no model restructuring
when adapting to evolving data streams. Our analysis
shows that Streaming HS-Trees has constant
amortised time complexity and constant memory
requirement. When compared with a state-of-theart
method, our method performs favourably in
terms of detection accuracy and runtime performance.
Our experimental results also show that
the detection performance of Streaming HS-Trees
is not sensitive to its parameter settings.
Published In:
IJCAI'11 Proceedings of the Twenty-Second international joint conference on Artificial Intelligence
- Volume Volume Two Pages 1511-1516 (AAAI Press © 2011)
Access the paper here:
http://dl.acm.org/citation.cfm?id=2283647
or http://ijcai.org/papers11/Papers/IJCAI11-254.pdf
Program for running the experiments
Program file name: fastanomaly_bytecode_released_at_google_site.zip
(see the file attached at the end of this page.)
Disclaimer: This program is provided for research purposes only, and it is provided solely for replicating my experimental results on a different computer. Also, it is not desiged for any commercial use. Use it at your own risk.
Datasets used
If you intend to make a head-to-head comparision with my method, you should use the datasets attached here as it simulates the exact sequence of instances read in by the programs.
Some addition information about the datasets:
The Covertype (aka the Forest Covertype) data which is available at UCI repository. For more details see:
http://archive.ics.uci.edu/ml/datasets
In the dataset file attached, instances of minority class have been moved to different parts of the stream. So it is
Mulcross dataset was generated from a synthetic data generator. For more details see:
David M. Rocke and David L. Woodruff. Identification of outliers in multivariate data.
Journal of the American Statistical Association,
91(435):1047–1061, 1996.
Other datasets and related references are as follows:
http://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data.
http://archive.ics.uci.edu/ml/datasets/Staddtlog+(Shuttle).
Using the source code
Please find the Java source code for the outlier detection program developed for IJCAI paper.
Just unzip the file "LevelJForest_outlierDetect_Ver4.8.1H_OneClassLogMass_Opt0.8_sizeLmt25.zip" and use Netbean IDE 7.0.1 to load the Project “Stochastic”.
The code assumes the following paths used:
String dataPath = "G:\\myMonashComputerBackup\\C_drive\\My Documents\\RPC_ExternalData\\";
String resultPath = "G:\\myMonashComputerBackup\\D_drive\\iForest\\HSTreeLevelresults\\";
If you use different path-names, you can modify the path; but you need to do so carefully.
You can test run the program in two ways:
(1) Use a batch file. See path: LevelJForest_outlierDetect_Ver4.8.1H_OneClassLogMass_Opt0.8_sizeLmt25\Stochastic\build\classes
You will find many batch files there used for various experiments. For example, try this one: test_4002_03_04_09_10_40_ensemble_release_version.bat
(2) Define the arguments in Project Property page.