Laymen explanation
K-Means Clustering is a simple yet powerful algorithm in data science. If you like to know real-world applications of K-Means Clustering, then this document helps
There is an algorithm that tries to minimize the distance of the points in a cluster with their centroid – the k-means clustering technique. The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.
Centroids of newly formed clusters do not change
Points remain in the same cluster
Maximum number of iterations are reached
Compared to other clustering algorithms, K-means is the simple to understand and simple to implement
For streaming data (continuous online data), online K-means version is available
Vanilla K-means can’t handle non-convex sets
The algorithm is sensitive to outliers. There are ways to avoid this issue, however. For example, k-medoid
Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Many initialisation methods are in practice for solve this issue. Astrahan’s method is one such initialization method. Refer this paper compare this methods and others.
For a given data-set, k may not be unique. It depends on the problem at hand. The optimal number of clusters is subjective and depends on the method used for measuring similarities. For example, for a given data-set, user may want to group it in a way and another user may want to group in another way based on their interest.
Dataset in hand
Similarity measure
This is one of the most popular methods to determine this optimal value of k. Note that it works only if for your problem at hand, the cost function graph looks like elbow (Ref: below picture). Here WCSS stands for within cluster sum of squares.
It determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.
The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data.
It depends on the criteria function we select. For example, if criteria function is below, then elbow method is a good choice
Intra group similarity should be high and inter group dis-similarity should be high.
silhouette method is used in case elbow method doesn't work.
Consider two clusters where distance from a particular point (x,y) to both cluster is same. In the case of ties, k-means clustering will randomly assign the ambiguous point to a cluster.
Kernel k-means can handle it. Refer here for the detail. This paper also talks in this regard.
After clustering, check dimeter of each cluster. Merge two clusters if they are very close. Split those clusters whose points are loosely attached.
Recommender system uses it to identify clusters with similar properties
Data condensation. It is used to reduce the total dataset size
Document classification
Cluster segmentation
NN based self-organizing map and k-means are the most widely applied clustering algorithms. This paper talks about performance among different models.
It is usually not enough to just run an elbow method or other method above, determine the number of clusters and just run the standard K-Means with that.
In general we have to understand the data and get subject experts’ opinion on how many clusters there must be and what their approximate centroids should be. This can help to fix the initial cluster centers.
https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/
https://youtu.be/LxjL_yLaFS8?t=1799
https://pafnuty.wordpress.com/2013/08/14/non-convex-sets-with-k-means-and-hierarchical-clustering/
https://dzone.com/articles/10-interesting-use-cases-for-the-k-means-algorithm
https://stats.stackexchange.com/questions/58855/why-do-we-use-k-means-instead-of-other-algorithms
https://stats.stackexchange.com/questions/222235/explain-streaming-k-means
https://en.wikipedia.org/wiki/Data_stream_clustering
http://www.mit.edu/~9.54/fall14/slides/Class13.pdf
https://youtu.be/pBAbMbgaBAk?t=299
https://youtu.be/pBAbMbgaBAk?t=1268
https://stackoverflow.com/questions/52467382/break-tie-between-k-means-cluster
https://stackoverflow.com/questions/21619794/what-makes-the-distance-measure-in-k-medoid-better-than-k-means
https://youtu.be/AxARUMZh0sk?t=1232
https://youtu.be/q8gVpKl1f-4?t=392
https://www.researchgate.net/publication/220512399_An_initialization_method_for_the_K_-Means_algorithm_using_neighborhood_model
https://youtu.be/q8gVpKl1f-4?t=949
https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/
https://images.app.goo.gl/119e7h1oUX8BUnvs8
https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/
https://cse.iitk.ac.in/users/piyush/courses/ml_autumn16/771A_lec10_slides.pdf
https://towardsdatascience.com/kmeans-hyper-parameters-explained-with-examples-c93505820cd3
https://dzone.com/articles/k-means-and-som-gentle-introduction-to-worlds-most
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.167.671&rep=rep1&type=pdf