K-means clustering is a popular unsupervised machine learning algorithm partitioning a dataset into K distinct, non-overlapping clusters.
In simple terms, the goal is to group similar data points and discover underlying patterns in the data.
Each cluster is represented by a centroid which is the mean of all the data points assigned to that cluster.
The algorithm uses a distance metric (usually Euclidean distance) to measure the similarity between data points. Data points closer to each other are considered more similar.
The objective of k-means is to minimize the within-cluster sum of squares (WCSS), also known as inertia. WCSS is the sum of squared distances between each data point and its assigned cluster centroid.
How to get the optimal K?
The Elbow method and Silhouette analysis are common techniques used to determine the optimal number of clusters, K.
The elbow method is a heuristic approach to finding the optimal number of clusters. It's based on the idea that within-cluster variation (how tightly data points are grouped within a cluster) decreases as you add more clusters but at some point this is over-fitting, and the elbow reflects this.
While using the "elbow" or "knee of a curve" as a cutoff point is easy, it is less reliable.
Silhouette analysis focuses on how well each data point fits within its assigned cluster (cohesion), compared to how well it would fit in neighboring clusters (separation).
The silhouette ranges from −1 to +1, where a high value indicates that the object is well-matched to its cluster and poorly matched to neighboring clusters. A clustering with an average silhouette width of over 0.7 is considered strong.
While the Elbow Method is simpler and faster to compute, it may not work well if the "elbow" isn't obvious.
On the other hand, Silhouette Analysis is computationally more expensive but provides a more comprehensive view of cluster quality at the individual data point level.
While the standard k-means algorithm is widely used, several variants offer improvements in different areas.
-k-means++: Generally a good default choice for faster convergence. It improves upon standard k-means by choosing initial centroids more strategically, spreading them out based on a probability proportional to their squared distance from existing centroids.
- Mini-batch k-means: A great option for large datasets where speed is critical. It processes small random subsets of data (mini-batches) at each iteration, rather than the entire dataset.
- Kernel k-means: Consider this if you suspect your data has non-linear patterns. It uses a kernel function to implicitly map data into a higher-dimensional space that might be linearly separable.
- Fuzzy c-means: Useful when you want to allow for partial cluster membership. It assigns membership degrees to each data point for all the clusters, allowing partial membership.
Get in touch at jain.van@northeastern.edu