Many algorithms exist by which features can be clustered. These can be divided into classes such as partition based clustering, density based clustering, distribution based clustering, and hierarchical based clustering. Kmeans is the standard partition based clustering algorithm. DBScan is the standard partition based clustering algorithm. While newer than the other classes, Expectation Maximization (EM) using Gaussian Mixture Models (GMM) is quickly becoming another standard for clustering analysis.
This projects explores the ability of Kmeans, DBScan, and a bayesian variant of EM-GMM (EM-BGMM) to cluster data. The affects of projecting the data into a dimensional space determined by principal component analysis on the performance of the clustering algorithms are also investigated.
Kmeans, DBScan, and a bayesian variant of EM-GMM (EM-BGMM) struggled to cluster the dataset in its raw dimensions. After reducing the dimensions of the dataset using PCA, the performance of the Kmeans and EM-BGMM algorithms improved. It is worth noting that Kmeans and EM-BGMM algorithms roughly agree on the clusters in the projected data.
However, the DBScan algorithm exhibited a mixed change in performance as a result of projecting the data. The ability to detect groups within the data increased but the silhouette coefficient* decreased. The enhanced ability to detect groups likely results from changes in the topology of the input as a result of projection. The decreased silhouette coefficient results from flawed computations when the number of detected clusters is one. Even though the program was able to compute a number and label it as the silhouette score, the silhouette score does not actually exist when the number of clusters is one because the distance to the nearest cluster to which the data point is not assigned does not exist.
Reducing dimensions of the 1306 expression dataset also decreased run time by greater than half for all the algorithms.
*The silhouette coefficient is a metric of how the data fits within a cluster and is frequently used to evaluate the performance of clustering algorithms.
Kmeans
Expression dataset
DNAShapeR dataset
DBScan
Expression dataset
DNAShapeR dataset
Expression dataset
DNAShapeR dataset