Clustering and Silhouette score
x
Given some unlabelled data, after cleaning and other preprocessing pipeline, say we got a number of data points in n-dimesional hypercube. Our intention is to bucketize similar items together into one bucket and other similar items together in other bucket and so on for all the data points.
Clustering is a technique appropriate for this use-case. We have clustering algorithms which are based on different underlying principles ( Centroid-based, Hierarichal, Distribution-based, Density-based ).
To evaluate clustring models, we have various evaluation metrics ( Homogenity, Completeness, V-measure, Adjusted Rand index (ARI), Adjusted Mutual Info, Silhouette ). Most of these require labelled data, only Silhouette is a metric which can evaluate a clustering model without labelled data.
Silhouette - advantage: Does not require labelled data
measure of how similar an object is in its own cluster and how different it is from objects in other clusters
silhouette coefficient defined for each sample point
overall silhouette score averages sihouette coefficient of each sample point
S(i) = (b(i)-a(i))/max(a(i), b(i))
a(i) = mean distance of point i from all other points in same cluster
b(i) = mean distance of point i from all points in next nearest cluster
S = Average(S(i))
Bounded between -1(incorrect) and +1(perfect) clustering
scores around 0 indicate overlapping clusters
tend to be higher for dense and well seperated clusters
can be used for hyperparameter tuning in a clustering model