Datasets

This website provides a set of public benchmark datasets to evaluate learning algorithms in nonstationary environments. In particular, it provides datasets with incremental and gradual concept drifts. These datasets are well-suited for evaluation of stream algorithms that do not require actual labels during the online classification phase. A condition known as extreme verification latency.

We hope that this benchmark will encourage other researchers to share their data, code and detailed results, improving the reproducibility in the area.

For a better understanding of the properties of each dataset, see an animated visualization of each dataset.
DatasetNumber of classesNumber of features Number of examplesDrift for each # examplesKindResults*
1CDT2216,000400Syntheticview
2CDT2216,000400Syntheticview
1CHT2216,000400Syntheticview
2CHT2216,000400Syntheticview
4CR42144,400400Syntheticview
4CRE-V142125,0001,000Syntheticview
4CRE-V242183,0001,000Syntheticview
5CVT5240,0001,000Syntheticview
1CSurr2255,283600Syntheticview
4CE1CF52173,250750Syntheticview
UG_2C_2D [1]22100,0001,000Syntheticview
MG_2C_2D [1]22200,0002,000Syntheticview
FG_2C_2D [2]22200,000 2,000Syntheticview
UG_2C_3D [1]23200,0002,000Syntheticview
UG_2C_5D [1]25200,0002,000Syntheticview
GEARS_2C_2D [1] 22200,0002,000Syntheticview
       
 Keystroke  [3]  410 1,600200 Realview

* For a better understanding of results, read the paper of our algorithm SCARGC
Download (all datasets ~15MB)

Stream Classification Algorithm Guided by Clustering - SCARGC


How to cite this benchmark?
Souza, V.M.A.; Silva, D.F.; Gama, J.; Batista, G.E.A.P.A.  : Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency.  SIAM International Conference on Data Mining (SDM), pp. 873-881, 2015.

@inproceedings{souzaSDM:2015,
  title={Data Stream Classification Guided by Clustering on Nonstationary Environments and Extreme Verification Latency},
  author={Souza, V. M. A. and Silva, D. F. and Gama, J. and Batista, G. E. A. P. A.},
  booktitle={Proceedings of SIAM International Conference on Data Mining (SDM)},
  pages={873--881},
  year={2015}
}


Dataset Donnors

[1] - These datasets were kindly provided by the authors of the following paper: Dyer, K.B., Capo, R., Polikar,R. : COMPOSE: A Semisupervised Learning Framework for Initially Labeled Nonstationary Streaming Data. IEEE Transactions on Neural Networks and Learning SystemsVol. 25, No. 1, pp. 12-26, 2014.

[2] - Ditzler, G., Polikar, R. Incremental learning of concept drift from streaming imbalanced data. IEEE Transactions on Knowledge and Data EngineeringVol. 25, No. 10, pp. 2283-2301, 2013.

[3] - Dataset based on CMU dataset first presented by the authors of the following paper: Killourhy, K., Maxion, R. : Why did my detector do that?! In Recent Advances in Intelligent Data Analysis X, pp. 222-233,2011


We are open for new data! Please, send contributions by email (see the email in paper).