Initial Resolution & Clustering Analysis on Layer 1 of the MTG Dataset
Identifying Metrics for Refinement, i.e. Silhouette Scores and Shannon Entropy
Implementing An Iterative Clustering With Quality Metrics
Our implementation of metric based iterative clustering pipeline starts with performing Leiden clustering on a pre-processed dataset at a very low resolution, followed by evaluating each cluster using a selected quality metric like Shannon entropy or silhouette scores in a specified feature space (e.g., PCA or Cell by Gene Matrix).
Once initial clustering is run, the pipeline identifies low-quality clusters from an ideally broad set of assignments. For example, clusters with high entropy or low silhouette scores get flagged and the algorithm re-clusters those subsets using the leiden algorithm, and replaces them with finer subclusters to improve overall cluster quality. Additionally, NS-Forest can be used to apply re-clustering on a reduced matrix of only binary genes.
This process could be repeated for a fixed number of iterations or when convergence is reached which we calculate by tracking global/cluster-wise silhouette score values and stopping the process when scores stop improving. We also integrate UMAP plots and bar plots of quality scores to monitor progress.
Testing on Layer 1 MTG Data using Silhouette Scores
Exploring Results
Iterative Pipeline Results
Leiden Clustering at Resolution 1
Reference Annotated Clusters
As compared to default Leiden clustering, in the iterative pipeline results the excitatory cells are not over-clustered and the inhibitory cells are more granular.
Using adjusted rand index to compare the clustering assignments between Leiden at resolution 1 and the iterative clustering results with the manual annotations, we get an adjusted rand index of 0.89 for our pipeline and an adjusted rand index of 0.63 for Leiden at its default resolution which shows our pipeline is closer to that of manual annotations
Metrics:
2. Silhouette scores are a common metric used to determine how well-connected points are to each other and how well-separated clusters are from each other.
Shannon entropy was explored to measure the purity of clusters but showed little variance in our dataset due to overall noise. NS-Forest was implemented to filter for binary genes but variance was still low.
Page Leader: Sriya Paleti