Quicklinks
Earlier analysis shows there is too much noise in the raw datasets for any of the clustering algorithms to effectively discern groupings. Using PCA analysis, the expression dataset from strain 1306 was reduced to the features that represent the majority of the variation within dataset. The projected expression dataset were clustered using the same algorithms as before to see if the reduction in dimensions improves the performance of the clustering algorithm.
The following sites were consulted in development of this code:
https://medium.com/@prasadostwal/multi-dimension-plots-in-python-from-2d-to-6d-9a2bf7b8cc74
Setup sbatch and python environment
Import and format the dataset
Decompose the scaled data to its orthogonal components
Plot variance explained by the principal components to determine the dimensions of the feature space
Select the eigenvalue, eigenvector pairs that will define the new feature space.
Project the expression data into the principal component feature space
Reformat output of previous step in bash
Import packacges
Import data
Compress principal components 5,6,7
Plot using plotly scatter_3d
See the relevant clustering algorithm page for the code used to cluster the transposed dataset.
The graph to the write shows that the majority of the variation is explained within the first 3 components. Principal components beyond 7 explain little of the dataset. Thus, the principal component feature space will be constructed of the first 7 components.
The first three principal components are represented by the x, y, and z plane with x representing the first, y the second, and z the third. The fourth principal component is represented by the fill color of the traces. The fifth, sixth, and seventh principal components are represented by the line color of the traces.
Summary
Reducing the dimensions of the dataset using PCA and then projecting the expression data into the principal component feature space improved the clustering of the Kmeans and EM-BGMM algorithms. The performance of DBScan reflects an improvement although it is a more complex case in which the improvement is not as evident and substantial.
For DBscan, the number of clusters detected after filtering is 2 while before filtering was 1. Regions of low concentration that resulted from projecting the data into the principal component feature space may have caused this increase. These area of low concentration are a primary factor in DBScan's clustering algorithm for defining a cluster. Thus even though DBScan's silhouette coefficient* decreased when the data was projected, the projection of the data improved DBScan's ability to detect clusters. That is to say, the silhouette score of DBScan on the original data was artificially inflated due to the presence of only one cluster as part of the computation for the score relies on the distance to the nearest cluster to which the data point is not assigned. When there is only one cluster detected, as was the case with original data, the distance to the nearest cluster to which the data point is not assigned does not exist. Most likely the programs to avoid arithmetic errors set the the distance to the nearest cluster to which the data point is not assigned to nearly zero. Thus, even though the program was able to compute a number and label it as the silhouette score, the silhouette score actually does not exist because the distance to the nearest cluster to which the data point is not assigned does not exist.
Reducing dimensions of the 1306 expression dataset also decreased run time from approximately 7 hours to roughly 2 hours for the EM-BGMM algorithm. Similar reductions were seen for the other algorithms.
*The silhouette coefficient is a metric of how the data fits within a cluster and is frequently used to evaluate the performance of clustering algorithms.
Average posterior probability
Standard deviation of posterior probabilities
Average posterior probability
Standard deviation of posterior probabilities