This project was completed as part of my doctoral course work. This project is not part of the research completed for my dissertation.
Summary: The data was projected into a dimensional space reduced from the original data dimensions using principal component analysis. One dataset was FDR expression values of the genes of an organism. The other dataset was predictions of DNA shape based on DNAShapeR. Projection prior to clustering improved the ability of the Kmeans, DBScan, and Expectation Maximization (EM) using Bayesian Gaussian Mixture Models (BGMM) algorithms to detect clusters in two datasets. Also the run time was reduced by greater than 50%.
Many algorithms exist by which features can be clustered. These can be divided into classes such as partition based clustering, density based clustering, distribution based clustering, and hierarchical based clustering. Kmeans is the standard partition based clustering algorithm. DBScan is the standard partition based clustering algorithm. While newer than the other classes, Expectation Maximization (EM) using Gaussian Mixture Models (GMM) is quickly becoming another standard for clustering analysis.
This projects explores the ability of Kmeans, DBScan, and a bayesian variant of EM-GMM (EM-BGMM) to cluster data. The affects of projecting the data into a dimensional space determined by principal component analysis on the performance of the clustering algorithms are also investigated.
Kmeans, DBScan, and a bayesian variant of EM-GMM (EM-BGMM) struggled to cluster the dataset in its raw dimensions. After reducing the dimensions of the dataset using PCA, the performance of the Kmeans and EM-BGMM algorithms improved. It is worth noting that Kmeans and EM-BGMM algorithms roughly agree on the clusters in the projected data.
However, the DBScan algorithm exhibited a mixed change in performance as a result of projecting the data. The ability to detect groups within the data increased but the silhouette coefficient* decreased. The enhanced ability to detect groups likely results from changes in the topology of the input as a result of projection. The decreased silhouette coefficient results from flawed computations when the number of detected clusters is one. Even though the program was able to compute a number and label it as the silhouette score, the silhouette score does not actually exist when the number of clusters is one because the distance to the nearest cluster to which the data point is not assigned does not exist.
Reducing dimensions of the 1306 expression dataset also decreased run time by greater than half for all the algorithms.
*The silhouette coefficient is a metric of how the data fits within a cluster and is frequently used to evaluate the performance of clustering algorithms.
Kmeans
Expression dataset
DNAShapeR dataset
DBScan
Expression dataset
DNAShapeR dataset
Expression dataset
DNAShapeR dataset