Quicklinks
Earlier analysis shows there is too much noise in the raw datasets for any of the clustering algorithms to effectively discern groupings. The values in the DNAShapeR dataset for the 5 DNAShapes were statistically summarized. Using PCA analysis, the summarized shape dataset was reduced to the features that represent the majority of the variation within dataset. The projected shape dataset was clustered using the same algorithms as before to see if the reduction in dimensions improves the performance of the clustering algorithm.
Setup sbatch and import data
Compute summary statistics
Rscript to compute summary statistics
This was done in a similar fashion to the clustering of the projected expression dataset. Please consult that page for the code.
The graph to the write shows that the majority of the variation is explained within the first 4 components. Principal components beyond this explain little of the dataset.
The first three principal components are represented by the x, y, and z plane with x representing the first, y the second, and z the third. The fourth principal component is represented by the fill color of the traces. The fifth principal component is represented by the line color of the traces.
Summary
Summarizing the data for the five shape features with statistics and reducing the dimensions using PCA improved the clustering of the Kmeans and EM-BGMM algorithms. The performance of DBScan reflects an improvement although it is a more complex case in which the improvement is not as evident and substantial.
See the clustering projected expression data page for an explanation of the complexity and flaws with DBScan.
*The silhouette coefficient is a metric of how the data fits within a cluster and is frequently used to evaluate the performance of clustering algorithms.