[TBD]Role of k-means in machine learning

Introduction

Laymen explanation

K-Means Clustering is a simple yet powerful algorithm in data science. If you like to know real-world applications of K-Means Clustering, then this document helps

Technical explanation

There is an algorithm that tries to minimize the distance of the points in a cluster with their centroid – the k-means clustering technique. The main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.

Stopping Criteria for K-Means Clustering

Centroids of newly formed clusters do not change
Points remain in the same cluster
Maximum number of iterations are reached

Advantages of K-means

Compared to other clustering algorithms, K-means is the simple to understand and simple to implement
- For streaming data (continuous online data), online K-means version is available

Challenges of k-means

Vanilla K-means can’t handle non-convex sets

The algorithm is sensitive to outliers. There are ways to avoid this issue, however. For example, k-medoid
- Unfortunately, due to its gradient descent nature, this algorithm is highly sensitive to the initial placement of the cluster centers. Many initialisation methods are in practice for solve this issue. Astrahan’s method is one such initialization method. Refer this paper compare this methods and others.

Choosing hyper-parameter k

Is hyperparameter k unique?

For a given data-set, k may not be unique. It depends on the problem at hand. The optimal number of clusters is subjective and depends on the method used for measuring similarities. For example, for a given data-set, user may want to group it in a way and another user may want to group in another way based on their interest.

Factors impacting K

- Dataset in hand
- Similarity measure

Selecting hyper-parameter k

Elbow Method

This is one of the most popular methods to determine this optimal value of k. Note that it works only if for your problem at hand, the cost function graph looks like elbow (Ref: below picture). Here WCSS stands for within cluster sum of squares.

Average silhouette method

It determines how well each object lies within its cluster. A high average silhouette width indicates a good clustering.

Gap statistic method

The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data.

Which k determine method to use

It depends on the criteria function we select. For example, if criteria function is below, then elbow method is a good choice

Intra group similarity should be high and inter group dis-similarity should be high.

silhouette method is used in case elbow method doesn't work.

Special cases

Handling tie

Consider two clusters where distance from a particular point (x,y) to both cluster is same. In the case of ties, k-means clustering will randomly assign the ambiguous point to a cluster.

Handling dataset which has both convex and non-convex clusters

Kernel k-means can handle it. Refer here for the detail. This paper also talks in this regard.

Can split and merge technique applied to K-means?

After clustering, check dimeter of each cluster. Merge two clusters if they are very close. Split those clusters whose points are loosely attached.

Use in machine learning

- Recommender system uses it to identify clusters with similar properties
- Data condensation. It is used to reduce the total dataset size
- Document classification
- Cluster segmentation

Data stream clustering

Relevance with neural networks

NN based self-organizing map and k-means are the most widely applied clustering algorithms. This paper talks about performance among different models.

Point to remember

It is usually not enough to just run an elbow method or other method above, determine the number of clusters and just run the standard K-Means with that.

In general we have to understand the data and get subject experts’ opinion on how many clusters there must be and what their approximate centroids should be. This can help to fix the initial cluster centers.

Reference

https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-clustering/

https://youtu.be/LxjL_yLaFS8?t=1799

https://pafnuty.wordpress.com/2013/08/14/non-convex-sets-with-k-means-and-hierarchical-clustering/

https://dzone.com/articles/10-interesting-use-cases-for-the-k-means-algorithm

https://stats.stackexchange.com/questions/58855/why-do-we-use-k-means-instead-of-other-algorithms

https://stats.stackexchange.com/questions/222235/explain-streaming-k-means

https://en.wikipedia.org/wiki/Data_stream_clustering

http://www.mit.edu/~9.54/fall14/slides/Class13.pdf

https://youtu.be/pBAbMbgaBAk?t=299

https://youtu.be/pBAbMbgaBAk?t=1268

https://stackoverflow.com/questions/52467382/break-tie-between-k-means-cluster

https://stackoverflow.com/questions/21619794/what-makes-the-distance-measure-in-k-medoid-better-than-k-means

https://youtu.be/AxARUMZh0sk?t=1232

https://youtu.be/q8gVpKl1f-4?t=392

https://www.researchgate.net/publication/220512399_An_initialization_method_for_the_K_-Means_algorithm_using_neighborhood_model

https://youtu.be/q8gVpKl1f-4?t=949

https://www.geeksforgeeks.org/elbow-method-for-optimal-value-of-k-in-kmeans/

https://images.app.goo.gl/119e7h1oUX8BUnvs8

https://www.datanovia.com/en/lessons/determining-the-optimal-number-of-clusters-3-must-know-methods/

https://cse.iitk.ac.in/users/piyush/courses/ml_autumn16/771A_lec10_slides.pdf

https://towardsdatascience.com/kmeans-hyper-parameters-explained-with-examples-c93505820cd3

https://dzone.com/articles/k-means-and-som-gentle-introduction-to-worlds-most

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.167.671&rep=rep1&type=pdf