Datasets

This page gives the real datasets used in our papers for testing the efficiency of the algorithms. These datasets do not necessarily contain meaningful clusters, but good algorithms should be fast in any case. Cluster analysis in practice often necessitates computing clusters under a wide range of parameter combinations. This in turn calls for an algorithm that has good efficiency under all settings. For synthetic data, please refer to the manual page for how to generate "seed spreading" data.


PAMAP2

A 4-dimensional dataset with cardinality 3,850,505, obtained by taking the first 4 principle components after running PCA on the PAMAP2 database [Reiss and Stricker 2012] from the UCI machine learning archive [Bache and Lichman 2013].


Farm

A 5-dimensional dataset with cardinality 3,627,086, which contains the VZ-features [Varma and Zisserman 2003] from a satellite image (showing a farm in Saudi Arabia) after re-scaling it into the resolution of 1825 x 2000.


Household

A 7-dimensional dataset with cardinality 2,049,280, which includes all the attributes of the Household database from the UCI archive [Bache and Lichman 2013] except the temporal columns date and time. Points in the original database with missing coordinates were removed.


References

  • K. Bache and M. Lichman. 2013. UCI Machine Learning Repository. (2013). http:archive.ics.uci.edu/ml

  • Attila Reiss and Didier Stricker. 2012. Introducing a New Benchmarked Dataset for Activity Monitoring. In International Symposium on Wearable Computers. 108–109.

  • Manik Varma and Andrew Zisserman. 2003. Texture Classification: Are Filter Banks Necessary?. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition(CVPR). 691–698.