Clustering a projected DNAShapeR dataset

Quicklinks

Introduction

Implementation

Summarize the DNAShapeR predictions for the 5 feature types

Projection and running clustering algorithms

Results

Projection onto a principal component defined feature space

Clustering of the dataset

Kmeans

DBScan

EM-BGMM

Introduction

Earlier analysis shows there is too much noise in the raw datasets for any of the clustering algorithms to effectively discern groupings. The values in the DNAShapeR dataset for the 5 DNAShapes were statistically summarized. Using PCA analysis, the summarized shape dataset was reduced to the features that represent the majority of the variation within dataset. The projected shape dataset was clustered using the same algorithms as before to see if the reduction in dimensions improves the performance of the clustering algorithm.

Implementation

Summarize the DNAShapeR predictions for the 5 feature types

Setup sbatch and import data

Compute summary statistics

Rscript to compute summary statistics

Projection and running clustering algorithms

This was done in a similar fashion to the clustering of the projected expression dataset. Please consult that page for the code.

Results

Projection onto a principal component defined feature space

Constructing a feature space of reduced dimensions

The graph to the write shows that the majority of the variation is explained within the first 4 components. Principal components beyond this explain little of the dataset.

The expression data represented in the principal component feature space

The first three principal components are represented by the x, y, and z plane with x representing the first, y the second, and z the third. The fourth principal component is represented by the fill color of the traces. The fifth principal component is represented by the line color of the traces.

Clustering of the dataset

Summary

Summarizing the data for the five shape features with statistics and reducing the dimensions using PCA improved the clustering of the Kmeans and EM-BGMM algorithms. The performance of DBScan reflects an improvement although it is a more complex case in which the improvement is not as evident and substantial.

See the clustering projected expression data page for an explanation of the complexity and flaws with DBScan.

*The silhouette coefficient is a metric of how the data fits within a cluster and is frequently used to evaluate the performance of clustering algorithms.