Quicklinks
K-means clustering is an example of partition based clustering. Algorithms based on partition clustering requiring the user to input the expected number of clusters in the data set. The expected number of clusters is often signified by k. The optimal value of k is chose by simulations in which sum of the squared error for all values within a reasonable range are computed. The optimal value of k is selecting by applying the elbow method to the sum of the squared error values.
Once a k value is input, the algorithm randomly assigns k number of centroids and then assigns each scaled data point to the nearest centroid. The assignment represents the expectation step of the expectation maximization. In the maximation step, the means for all points are calculated and new centroids chosen.
The workflow for this portion of the project draws heavily from https://realpython.com/k-means-clustering-python/ .
Import libraries
Note: kneed requires python >3.5
Setup inputs and lists
Data preparation and application of the elbow method
Import libraries
Note: kneed requires python >3.5
Setup inputs and lists
Data preparation and application of the elbow method
The graphs of the sum of the squared error values for values of k ranging from 1 to 25 indicate the optimal number of groups is somewhere between 5 and 10. The graphs are insufficient to determine the value of k through visual examination alone. However, mathematical calculation of the elbow from the values represented in the graphs the following values of k were determined to be optimal: "1Shape" : 7, "2Shape" : 7, "1306 expression" : 8, "88069_expression" : 7
The plots demonstrate kmeans clustering was unable to separate the data into realistic and feasible cluster. This further supported by the values of the silhouette coefficients. The range of possible values for a silhouette coefficient is from -1 to 1. Larger absolute values of the silhouette coefficient indicate the datapoints fit well within their assigned cluster and are distant from other clusters. The computed values for the expression dataset are in the bottom 15% percent of possible values. The computed values for the DNAShapeR datasets are even smaller. All of the silhouette coefficients are outside of reasonable limits.
silhouette coefficient: 0.1182
silhouette coefficient: 0.1068
silhouette coefficient: -0.0004
silhouette coefficient: -0.0050