Non-hierarchical cluster analysis
The main idea...
Non-hierarchical cluster analysis aims to find a grouping of objects which maximises or minimises some evaluating criterion. Many of these algorithms will iteratively assign objects to different groups while searching for some optimal value of the criterion. Below, a popular example of a non-hierarchical cluster analysis is described.
K-means clustering
K-means clustering aims to assign objects to a user-defined number of clusters (k) in such a way that maximises the separation of those clusters while minimising intra-cluster distances relative to the cluster's mean or centroid (Figure 1). The algorithm typically defaults to Euclidean distances, however, alternate criteria, such as different distance or dissimilarity measures, can be accepted by many implementations. If not, alternate distances or dissimilarities may be transformed to be compatible with Euclidean representation. The user is usually expected to set 3 parameters: 1) the number of clusters expected (k), 2) the initialisation method, and 3) the distance metric to be used.
This algorithm has a variety of implementations (see Jain, 2010), however, one of the most popular uses an iterative refinement approach (Figure 2).
An initialisation step creates k centroids. Some algorithms use an existing object as an initial centroid while others randomly assign objects into k clusters in order to calculate the first centroids. It is often possible to define custom initial placements for centroids.
An assignment step places each object into a cluster whose centroid (mean) is closest to it.
An update or re-assignment step then calculates new centroids based on the membership of each cluster.
Step 2 and 3 are repeated until the solution converges, i.e. when the centroid positions no longer change.
While simple in implementation, the K-means algorithm has several features which are important to grasp to use this method appropriately. Please go through the Warnings section of this page and Davidson (2002).
The clustering algorithm can be constrained by rules governing object membership. For exampling, must-link and cannot-link constraints can be imposed on individual objects.
Figure 1: A schematic of a two-dimensional k-means clustering solution with (a) k = 3 and (b) k = 6.
Figure 2: A schematic illustration of a common k-means clustering algorithm. (a) After an initialisation step which places k centroids (filled diamonds) among the objects (filled circles), objects are assigned to a cluster based on their distances to these centroids (b). An update step then recalculates the position of the centroids to reflect the mean location of the objects in a given cluster (c). Steps (b) and (c) are repeated until a stable solution is found (d).
Warnings
The k-means algorithm can become 'stuck' in local optima. Repeating the clustering algorithm and adding noise to the data can help evaluate the robustness of the solution.
The k-means algorithm will favour higher values of k. This is not necessarily desirable and users should consider carefully which values of k are sensible for their data set.
The full k-means algorithm is computationally intensive. Heuristic algorithms are available for large datasets.
Solutions may vary with different distance measures.
The distances of objects to their cluster centroids are the primary factor contributing to a k-means solution. The algorithm is insensitive to other features of an object population. Thus, verify that the populations of interest can be classified well by their distance to their multivariate mean.
In the original k-means algorithm, objects must be assigned to one cluster. This may be problematic in cases where an object is equidistant from two or more centroids. It is also problematic if there exist outliers. These outliers will be forced into a cluster, affecting the positioning of its centroid.
If populations of objects overlap, the k-means algorithm may provide biased estimates of their centroids, being 'pulled' towards regions where no (or less) overlap occurs.
Standardising variables may change the solution. Prior to standardisation, variables with larger ranges will contribute more to the distance of objects.
Comparing clustering solutions with different values of k may be problematic, as many solutions for each value would have to be examined to prevent evaluating solutions that represent local optima.
When using Euclidean distances, variables which are correlated will naturally influence object positioning in similar ways. Thus, if a large proportion of the variables in an analysis are strongly correlated, the influence of other variables may be poorly represented. To avoid this, Davidson (2002) recommends performing factor analysis on the variables and, if it is successful, using the extracted factors as input for k-means clustering.
References
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recogn Lett. 31(8):651–666.
Davidson I (2002) Understanding K-means non-hierarchical clustering.Technical Report. Computer Science Department of State University of New York (SUNY), Albany.