Non-hierarchical cluster analysis

The main idea...

Non-hierarchical cluster analysis aims to find a grouping of objects which maximises or minimises some evaluating criterion. Many of these algorithms will iteratively assign objects to different groups while searching for some optimal value of the criterion. Below, a popular example of a non-hierarchical cluster analysis is described.

K-means clustering

K-means clustering aims to assign objects to a user-defined number of clusters (k) in such a way that maximises the separation of those clusters while minimising intra-cluster distances relative to the cluster's mean or centroid (Figure 1). The algorithm typically defaults to Euclidean distances, however, alternate criteria, such as different distance or dissimilarity measures, can be accepted by many implementations. If not, alternate distances or dissimilarities may be transformed to be compatible with Euclidean representation. The user is usually expected to set 3 parameters: 1) the number of clusters expected (k), 2) the initialisation method, and 3) the distance metric to be used.

This algorithm has a variety of implementations (see Jain, 2010), however, one of the most popular uses an iterative refinement approach (Figure 2).

While simple in implementation, the K-means algorithm has several features which are important to grasp to use this method appropriately. Please go through the Warnings section of this page and Davidson (2002). 

The clustering algorithm can be constrained by rules governing object membership. For exampling, must-link and cannot-link constraints can be imposed on individual objects.

 Figure 1: A schematic of a two-dimensional k-means clustering solution with (a) k = 3 and (b) k = 6.

Figure 2: A schematic illustration of a common k-means clustering algorithm. (a) After an initialisation step which places k centroids (filled diamonds) among the objects (filled circles), objects are assigned to a cluster based on their distances to these centroids (b). An update step then recalculates the position of the centroids to reflect the mean location of the objects in a given cluster (c). Steps (b) and (c) are repeated until a stable solution is found (d).

Warnings

References