Clustering is a method to group the data in different clusters that are "similar" (or "dissimilar"). More precisely, it can be defined as a task to identify subgroups based on:
Similarity for data inside the cluster
Dissimilarity for data in different clusters
The way we define similarity (or dissimilarity) depends on what we mean by the difference of data entries. The decision to which a similarity or dissimilarity measure is chosen to some extent is application-specific. For instance, the two data entries' similarity (or dissimilarity) can be measured by their Euclidean distance, or their variance if they are for instance time series.
Once the similarity between two single data is introduced, one can define the similarity between two "clusters" of data. For instance, one can define the distance between two clusters as the difference between their centroids (their member average), the maximum distance between any two members of the two clusters, or the minimum distance between any two members of the two clusters.
Definition. Consider a matrix where its rows and columns are assigned to a member of a population. A proximity matrix shows the distance between any two members.
There are two popular methods:
k-mean clustering
Hierarchical clustering
k-mean clustering is a method that groups a set of given data into k groups where the distance is just simply the Euclidean distance. The result of the k-mean clustering is k clusters of the data with the minimum square sum of each cluster points from its centroid (or the average). In other words, a cluster minimizes the so-called within-cluster sum of squares (WCSS). This is the best way to describe the similarity among the members of the clusters. On the other hand, the clustering results are the maximum of the so-called between-cluster sum of squares, (BCSS). This is the best way to describe the dissimilarity among the members of different clusters.
Let us consider two-dimensional data (two features), like the one shown in the picture. It is quite clear that there are seemingly four clusters in the picture. In the following, we propose an algorithm that can identify the four clusters that we can recognize in the picture. The basis of the algorithm is to start with k points, and then try to update them until we get some clusters with minimum total variance.
There is a very nice algorithm that will create the clusters almost efficiently.
Choose k initial points (vectors) and call them centers.
Associate the cluster of the closest points to each center.
Update the centers to the centroids of the clusters which is the average of the points in the cluster.
Go to step 2 if the termination condition has not reached yet.
As it is shown in the first picture for any two neighboring points we need to find the perpendicular bisector of their connecting segment and use it to find the closest points to each point. Once done, we take an average of points in each cluster which is centroids. Then we repeat the last step again, by connecting the points, finding the perpendicular bisector, and then the closest points to each centroid. We continue this until we reach a termination condition. For instance, if the centroids do not move a lot after few steps we can stop.
Here we show how it may converge in three steps (in reality it takes more steps).
There is always a question: what is the best choice for k. First of all, it is obvious that larger k always gives a smaller total variance. However, the best option is not to have a very large k as a larger k explores less of the hidden structure in the data. For that, we will consider different k (for instance k = 1,..,7) and do the clustering for every single k. Then, we find the total variance of each one and draw it as a function of k (as it is shown here). There is always a point, that we call the elbow, after which the value of the total variance will not be reduced significantly. This essentially means larger k would not benefit us in terms of having a smaller variance. Usually, this is taken as the best k in a cluster.
Hierarchical clustering builds a hierarchy of clusters. It can be done in two manners, Agglomerative and Divisive:
Agglomerative is a "bottom-up" approach where each sample is its own cluster, and pairs of clusters are merged as one moves up the hierarchy. This is shown step by step in the picture. In the first one, all single observations are a cluster; then we pair them by choosing the closest observations, and then we continue by merging the closest clusters gain, as so on.
The algorithm of agglomerative is as follows:
Compute the proximity matrix
Each data point be a cluster
Repeat: by merging the two closest clusters and update the proximity matrix
Until only a single cluster remains
Divisive is a reverse method ("top-down" approach) where all observations start in one cluster, and split recursively to singletons.
Let us consider a population of 6 members whose proximity matrix is given as shown in the picture. These observations are essentially vectors in the space of features, with p dimension. We simply call the observations by their numbers, 1,2,3,4,5 and 6.
We start off by taking any point as a cluster (1), (2), (3), (4), (5), (6).
Then we join the closest one in one cluster. In our example we have two pairs (1,2) and (3,4) that join in two clusters because the distance between the observations of the two pairs are equal to 0.5.
Then we join (5) to (1, 2) as the distance between these two clusters is the smallest among the others. The distance between clusters is 1 ((2) is closer to (5)).
Then The closest clusters are (1,2,5) and (6), with distance 1.5 ((5) to (6)). Finally we left with the cluster of whole members (1,2,3,4,5,6).
One interesting way to present the hierarchical clustering is to use a heat map on the proximity matrix. The strength of the colors in a heat map represents the value (higher = darker). As you can see in the first place when we color the proximity matrix there is no structure. But let us use hierarchical clustering and put closer clusters closer to each other. In that, we have to keep first (1,2) and (3,4) together. Then we have to move (5) to the neighbor of (1,2). Then (6) to (1,2,5). You can see this is done in the figure, we have in rows and in the columns, 1, 2, 5, 6, 3, 4. The interesting fact here is that now you can see a structure as more similar colors are closer.
A heat map is very useful when we have a very large cluster where visualization is very helpful.
Once the clustering is done, we can create the dendrogram plot, which is very useful in understanding the clusters. In this plot, all clusters are connected in their hierarchy. It actually represents both similarity and the order of clustered being formed (or merged).
In the picture we showed the dendrogram of the cluster we saw earlier. From bottom to top, the clusters that are formed first are connected first. For instance, in the first step clusters (1,2) and (3,4) are formed. Then we see cluster (1,2,5), (1,2,3,5,6) and finally (1,2,3,4,5,6). The length of the branches also represents the distance of the clusters.
One benefit of plotting a dendrogram is to realize what is the best cut-off of the dendrogram which helps to find out the best number of the clusters. This helps to see where a significant change in the total variance would happen. As one can see, we have made two cutoffs one for 3 clusters and one for 2 clusters. There is no specific rule for cut-offs, but one may have rules based on experience. One way is to cut-off at the point that the distance between the clusters becomes very large. In this example the distance after 3 clusters becomes large so three cluster looks ok, which is (1,2,5), (6) and (3,4).
A dendrogram is useful when we have few clusters. When the data is large and we have lots of clusters, a heat map is more useful.
Silhouette analysis is used to study the separation distance between the clusters. The silhouette plot compares the belong-ness of any sample to its own and the neighboring (the closest) cluster. More precisely we associate a number between -1 and 1 to all members of a sample. If it is equal to 0 it means the individual sample is on or very close to the decision boundary between two neighboring and the sample is very close to the neighboring clusters; if it is equal to 1 the sample is far away from the neighboring clusters; and if it is -1 the sample is assigned to the wrong clusters.
This can help to find out how many clusters would be better. More precisely, we can see which number of clusters will give a higher averaging silhouette value.
In the picture, you can see the silhouette plot for cluster k = 3 and 4. As one can see the average of the silhouette for k = 4 is much larger compared to k = 3.