Introduction

"Knowledge is good only if it is shared. I hope this guide will help those who are finding the way around, just like me"

Clustering analysis has been an emerging research issue in data mining due its variety of applications. With the advent of many data clustering algorithms in the recent few years and its extensive use in wide variety of applications, including image processing, computational biology, mobile communication, medicine and economics, has lead to the popularity of this algorithms. Main problem with the data clustering algorithms is that it cannot be standardized. Algorithm developed may give best result with one type of data set but may fail or give poor result with data set of other types. Although there has been many attempts for standardizing the algorithms which can perform well in all case of scenarios but till now no major accomplishment has been achieved. Many clustering algorithms have been proposed so far. However, each algorithm has its own merits and demerits and cannot work for all real situations. Before exploring various clustering algorithms in detail let's have a brief overview about what is clustering.

Clustering is a process which partitions a given data set into homogeneous groups based on given features such that similar objects are kept in a group whereas dissimilar objects are in different groups. It is the most important unsupervised learning problem. It deals with finding structure in a collection of unlabeled data. For better understanding please refer to Fig I.

Fig I: showing four clusters formed from the set of unlabeled data

For clustering algorithm to be advantageous and beneficial some of the conditions need to be satisfied.

1) Scalability - Data must be scalable otherwise we may get the wrong result. Fig II shows simple graphical example where we may get the wrong result.

Fig II: showing example where scalability may leads to wrong result

2) Clustering algorithm must be able to deal with different types of attributes.

3) Clustering algorithm must be able to find clustered data with the arbitrary shape.

4) Clustering algorithm must be insensitive to noise and outliers.

5) Interpret-ability and Usability - Result obtained must be interpretable and usable so that maximum knowledge about

the input parameters can be obtained.

6) Clustering algorithm must be able to deal with data set of high dimensionality.

Clustering algorithms can be broadly classified into two categories:

1) Unsupervised linear clustering algorithms and

2) Unsupervised non-linear clustering algorithms

I. Unsupervised linear clustering algorithm

k-means clustering algorithm Fuzzy c-means clustering algorithm Hierarchical clustering algorithm Gaussian(EM) clustering algorithm Quality threshold clustering algorithm

II. Unsupervised non-linear clustering algorithm

MST based clustering algorithm kernel k-means clustering algorithm Density based clustering algorithm

References:

1) Data Clustering: A Review by A.K. Jain, M.N. Murty and P.J. Flynn.

2) http://home.dei.polimi.it/matteucc/Clustering/tutorial_html

3) Introduction to Clustering Techniques by Leo Wanner.

4) A Comprehensive Overview of Basic Clustering Algorithms by Glenn Fung.

5) Survey of Clustering Algorithms by Rui Xu and Donald Wunsch.