Reach Me

Fun Gallery

Introduction

"Knowledge is good only if it is shared. I hope this guide will help those who are finding the way around, just like me"

Clustering analysis  has been an  emerging  research  issue in data mining due its variety of  applications. With the advent of many data clustering  algorithms  in  the  recent  few  years  and its extensive  use in  wide variety of applications,  including image processing, computational biology, mobile communication, medicine and economics,  has  lead  to  the  popularity of  this  algorithms.  Main  problem with  the data clustering algorithms is   that   it  cannot  be  standardized.   Algorithm developed may  give  best  result  with one type of data set  but  may  fail or  give  poor  result with  data set of other types.  Although  there  has  been  many  attempts   for  standardizing  the  algorithms  which can   perform   well   in  all  case  of scenarios but  till  now  no major accomplishment  has been achieved. Many clustering algorithms  have  been  proposed so far. However, each  algorithm has its own  merits and demerits  and cannot  work  for  all  real  situations. Before exploring various clustering algorithms in detail let's have a brief overview about what is clustering.

Clustering  is a process  which  partitions a given data set  into  homogeneous  groups based on  given  features such that similar  objects  are  kept  in  a   group  whereas  dissimilar  objects  are  in  different  groups.   It   is   the  most  important  unsupervised  learning problem. It  deals  with  finding  structure in a collection of  unlabeled data. For better understanding please refer to Fig I.

Fig I: showing four clusters formed from the set of unlabeled data
 

For
clustering algorithm to be advantageous and beneficial some of the conditions need to be satisfied.

1) Scalability - Data must be scalable otherwise we may get the wrong result. Fig II shows simple graphical example where we may get the wrong result.


Fig II: showing  example where scalability may leads to wrong result

2) Clustering algorithm must be able to deal with different types of attributes.
3) Clustering algorithm must be able to find clustered data with the arbitrary shape.
4)
Clustering algorithm must be insensitive to noise and outliers.
5) Interpret-ability and Usability - Result obtained must be interpretable and usable so that maximum knowledge about
    the input parameters can be obtained. 
6)
Clustering algorithm must be able to deal with data set of high dimensionality.

Clustering algorithms can be broadly classified into two categories:
1) Unsupervised linear clustering algorithms and
2) Unsupervised non-linear clustering algorithms


I. Unsupervised linear clustering algorithm

k-means clustering algorithm           Fuzzy c-means clustering algorithm          Hierarchical clustering algorithm  Gaussian(EM) clustering algorithm          Quality threshold clustering algorithm            

II. Unsupervised non-linear clustering algorithm

MST based clustering algorithm              kernel k-means clustering algorithm                   Density based clustering algorithm

 

References:

1) Data Clustering: A Review by A.K. Jain, M.N. Murty and P.J. Flynn.

2) http://home.dei.polimi.it/matteucc/Clustering/tutorial_html

3) Introduction to Clustering Techniques by Leo Wanner.

4) A Comprehensive Overview of Basic Clustering Algorithms by Glenn Fung.

5) Survey of Clustering Algorithms by Rui Xu and Donald Wunsch.