K-means clustering

Quicklinks

Background

Implementation

Finding the optimal value of k

K-means clustering

Results

Finding the optimal value of k

K-means clustering

Background

K-means clustering is an example of partition based clustering. Algorithms based on partition clustering requiring the user to input the expected number of clusters in the data set. The expected number of clusters is often signified by k. The optimal value of k is chose by simulations in which sum of the squared error for all values within a reasonable range are computed. The optimal value of k is selecting by applying the elbow method to the sum of the squared error values.

Once a k value is input, the algorithm randomly assigns k number of centroids and then assigns each scaled data point to the nearest centroid. The assignment represents the expectation step of the expectation maximization. In the maximation step, the means for all points are calculated and new centroids chosen.

Learn more about k-means clustering

Implementation

The workflow for this portion of the project draws heavily from https://realpython.com/k-means-clustering-python/ .

Finding the optimal value of k

Import libraries

Note: kneed requires python >3.5

Setup inputs and lists

Data preparation and application of the elbow method

K-means clustering

Import libraries

Note: kneed requires python >3.5

Setup inputs and lists

Data preparation and application of the elbow method

Results

Finding the optimal value of k

The graphs of the sum of the squared error values for values of k ranging from 1 to 25 indicate the optimal number of groups is somewhere between 5 and 10. The graphs are insufficient to determine the value of k through visual examination alone. However, mathematical calculation of the elbow from the values represented in the graphs the following values of k were determined to be optimal: "1Shape" : 7, "2Shape" : 7, "1306 expression" : 8, "88069_expression" : 7

1306 expression data

88069 expression data

1Shape data

2Shape data

K-means clustering

The plots demonstrate kmeans clustering was unable to separate the data into realistic and feasible cluster. This further supported by the values of the silhouette coefficients. The range of possible values for a silhouette coefficient is from -1 to 1. Larger absolute values of the silhouette coefficient indicate the datapoints fit well within their assigned cluster and are distant from other clusters. The computed values for the expression dataset are in the bottom 15% percent of possible values. The computed values for the DNAShapeR datasets are even smaller. All of the silhouette coefficients are outside of reasonable limits.

1306 expression data

silhouette coefficient: 0.1182

88069 expression data

silhouette coefficient: 0.1068

1Shape data

silhouette coefficient: -0.0004

2Shape data

silhouette coefficient: -0.0050

Page updated

Report abuse