Abstract
Finding clusters in data is a challenging problem. Given a dataset, we usually do not know the number of natural clusters hidden in the dataset. The problem is exacerbated when there is little or no additional information except the data itself. This paper proposes a general stochastic clustering method (GSCM) that is a simplification of nature-inspired ant-based clustering approach. It begins with a basic solution and then performs stochastic search to incrementally improve the solution until the underlying clusters emerge, resulting in automatic cluster discovery in datasets. This method differs from several recent methods in that it does not require users to input the number of clusters and it makes no explicit assumption about the underlying distribution of a dataset. Our experimental results show that the proposed method performs better than several existing methods in terms of clustering accuracy and efficiency in majority of the datasets used in this study. Our theoretical analysis shows that the proposed method has linear time and space complexities, and our empirical study shows that it can accurately and efficiently discover clusters in large datasets in which many existing methods fail to run.
Strengths of GSCM:
(1) Automatic discovery of clusters in datasets
(2) Can work with large datasets
Weaknesses of GSCM:
(1) Quite a few parameters to set
(2) Does not work well for data with clusters of arbitrary shapes
Further research that can be done on GSCM:
(1) Apply GSCM to a new dataset. If you like, you could send me the dataset and I can analyse it for you. Alternative, you can download the bytecode here and ask me if you need help in generating the results.
(2) I have some ideas to improve this method, do email me and we can have a discussion
Published In: Journal: Pattern Recognition
Access the paper here: http://www.sciencedirect.com/science/article/pii/S0031320311001385
Program for running the experiments
Program file name: GSCM_Relased_at_googlesite.zip
(see the file attached at the end of this page.)
Disclaimer: This program is provided for research purposes only, and it is provided solely for replicating my experimental results on a different computer. Also, it is not desiged for any commercial use. Use it at your own risk.
Once you have download GSCM_Relased_at_googlesite.zip, you can unzip it and there is a Readme.txt file inside that gives you the instructions on where to keep your files so that you can run the program. Once prepared the files in respective folders, click 50Trials_testRunall_data.bat to run.
If you need the source codes, please email to me. (Email address given in my home page)
Real Datasets can be obtained from UCI Machine Learning Repository.
Synthetic Datasets used are attached here.