Clustering with K-means

K-means is an iterative algorithm that has become very popular in machine learning because of its simplicity, speed, and scalability to a large number of data points. The K-means algorithm relies on the idea that there are a specific number of data groups, called clusters.

The K-means algorithm expects to find clusters in your data. All you have to do is to specify the number of groups you expect and the K-means algorithm will look for them.

Music by: bensound.com

The example begins by loading the digits csv and assigning the data to a variable.

The next step is to process the data using a PCA.

Even though PCA can recreate the same number of variables as in the input data, the example code drops a few using the n_components parameter. The decision to use 30 components, as compared to the original 64 variables, allows the example to retain most of the original information (about 90 percent of the original variation in data) and simplify the dataset by removing correlation and reducing redundant variables and their noise.


Documentation: PCA, scale

Looking for optimal solutions

As mentioned in the previous part, the example is clustering ten different numbers. It's time to start checking the solution with K = 10 first. The following code compares the previous clustering result to the ground truth -- the true labels -- in order to determine whether there is any correspondence.


Documentation: PCA, scale, KMeans
> Converting the solution, given by the labels variable internal to the clustering class, into a pandas DataFrame allows it to apply a cross-tabulation and compare the original labels with the labels derived from clustering.

Another observation you can make is that even though there are just ten numbers in this example, there are more types of handwritten forms of each, hence the necessity of finding more clusters.

You use inertia to measure the viability of a cluster. Inertia is the sum of all the differences between every cluster member and its centroid.

To obtain the inertia rate of change in Python, you will have to create a loop. Try progressive cluster solutions inside the loop, recording their values.


Documentation: Clustering, scale, KMeans

You use the inertia variable inside the clustering class after fitting the clustering. The inertia variable is a list containing the rate of change of inertia between a solution and the previous one. Here is some code that prints a line graph of the rate of change.

> When examining inertia's rate of change, look for jumps in the rate itself. If the rate jumps up, it means that adding a cluster more than the previous solution brings much more benefit than expected; if it jumps down instead, you're likely forcing a cluster more than necessary.

Exercise 4.4

Clustering on 32-dimension data
Using data.txt and ground_truth.txt, try to do the following.

  1. Perform PCA and scaling on the data then print out the explained variance ratio. Decide the optimal number of n_components. Save the new data in a variable.

  2. On the new data, perform k-clustering with k = 16. Then, print out the cross-tabulation with the original ground_truth.

  3. Find out the inertia rate of change starting with 12 to 20 clusters.

  4. Plot the table showing the rate of change of inertia and from there, guess the most promising peak for the number of clusters. (This might differ a bit every time you run this program)