Clustering a projected expression dataset

Quicklinks

Introduction

Implementation

For the expression dataset from strain 1306, extract features responsible for variation based on PCA results

Plotting the expression data in the 6D principal component feature space

Clustering the transposed dataset

Results

Projection onto a principal component defined feature space

Kmeans

DBScan

EM-BGMM

Introduction

Earlier analysis shows there is too much noise in the raw datasets for any of the clustering algorithms to effectively discern groupings. Using PCA analysis, the expression dataset from strain 1306 was reduced to the features that represent the majority of the variation within dataset. The projected expression dataset were clustered using the same algorithms as before to see if the reduction in dimensions improves the performance of the clustering algorithm.

Implementation

The following sites were consulted in development of this code:

https://towardsdatascience.com/principal-component-analysis-for-dimensionality-reduction-115a3d157bad

https://medium.com/@prasadostwal/multi-dimension-plots-in-python-from-2d-to-6d-9a2bf7b8cc74

For the expression dataset from strain 1306, extract features responsible for variation based on PCA results

Setup sbatch and python environment

Import and format the dataset

Decompose the scaled data to its orthogonal components

Plot variance explained by the principal components to determine the dimensions of the feature space

Select the eigenvalue, eigenvector pairs that will define the new feature space.

Project the expression data into the principal component feature space

Plotting the expression data in the 6D principal component feature space

Reformat output of previous step in bash

Import packacges

Import data

Compress principal components 5,6,7

Plot using plotly scatter_3d

Clustering the transposed dataset

See the relevant clustering algorithm page for the code used to cluster the transposed dataset.

Results

Projection onto a principal component defined feature space

Constructing a feature space of reduced dimensions that explains most of the variation in the dataset

The graph to the write shows that the majority of the variation is explained within the first 3 components. Principal components beyond 7 explain little of the dataset. Thus, the principal component feature space will be constructed of the first 7 components.

The expression data represented in the principal component feature space

The first three principal components are represented by the x, y, and z plane with x representing the first, y the second, and z the third. The fourth principal component is represented by the fill color of the traces. The fifth, sixth, and seventh principal components are represented by the line color of the traces.

Clustering the projected dataset

Summary

Reducing the dimensions of the dataset using PCA and then projecting the expression data into the principal component feature space improved the clustering of the Kmeans and EM-BGMM algorithms. The performance of DBScan reflects an improvement although it is a more complex case in which the improvement is not as evident and substantial.

For DBscan, the number of clusters detected after filtering is 2 while before filtering was 1. Regions of low concentration that resulted from projecting the data into the principal component feature space may have caused this increase. These area of low concentration are a primary factor in DBScan's clustering algorithm for defining a cluster. Thus even though DBScan's silhouette coefficient* decreased when the data was projected, the projection of the data improved DBScan's ability to detect clusters. That is to say, the silhouette score of DBScan on the original data was artificially inflated due to the presence of only one cluster as part of the computation for the score relies on the distance to the nearest cluster to which the data point is not assigned. When there is only one cluster detected, as was the case with original data, the distance to the nearest cluster to which the data point is not assigned does not exist. Most likely the programs to avoid arithmetic errors set the the distance to the nearest cluster to which the data point is not assigned to nearly zero. Thus, even though the program was able to compute a number and label it as the silhouette score, the silhouette score actually does not exist because the distance to the nearest cluster to which the data point is not assigned does not exist.

Reducing dimensions of the 1306 expression dataset also decreased run time from approximately 7 hours to roughly 2 hours for the EM-BGMM algorithm. Similar reductions were seen for the other algorithms.

*The silhouette coefficient is a metric of how the data fits within a cluster and is frequently used to evaluate the performance of clustering algorithms.