Team 7 - Project Page

Design Overview

Project Goals:

1

Initial Resolution & Clustering Analysis on Layer 1 of the MTG Dataset

2

Identifying Metrics for Refinement, i.e. Silhouette Scores and Shannon Entropy

3

Implementing An Iterative Clustering With Quality Metrics

Iterative Clustering Pipeline Overview

1.

Our implementation of metric based iterative clustering pipeline starts with performing Leiden clustering on a pre-processed dataset at a very low resolution, followed by evaluating each cluster using a selected quality metric like Shannon entropy or silhouette scores in a specified feature space (e.g., PCA or Cell by Gene Matrix).

2.

Once initial clustering is run, the pipeline identifies low-quality clusters from an ideally broad set of assignments. For example, clusters with high entropy or low silhouette scores get flagged and the algorithm re-clusters those subsets using the leiden algorithm, and replaces them with finer subclusters to improve overall cluster quality. Additionally, NS-Forest can be used to apply re-clustering on a reduced matrix of only binary genes.

3.

This process could be repeated for a fixed number of iterations or when convergence is reached which we calculate by tracking global/cluster-wise silhouette score values and stopping the process when scores stop improving. We also integrate UMAP plots and bar plots of quality scores to monitor progress.

Testing on Layer 1 MTG Data using Silhouette Scores

Exploring Results

Iterative Pipeline Results

Leiden Clustering at Resolution 1

Reference Annotated Clusters

As compared to default Leiden clustering, in the iterative pipeline results the excitatory cells are not over-clustered and the inhibitory cells are more granular.

Using adjusted rand index to compare the clustering assignments between Leiden at resolution 1 and the iterative clustering results with the manual annotations, we get an adjusted rand index of 0.89 for our pipeline and an adjusted rand index of 0.63 for Leiden at its default resolution which shows our pipeline is closer to that of manual annotations

Metrics:

2. Silhouette scores are a common metric used to determine how well-connected points are to each other and how well-separated clusters are from each other.

Shannon entropy was explored to measure the purity of clusters but showed little variance in our dataset due to overall noise. NS-Forest was implemented to filter for binary genes but variance was still low.

Page Leader: Sriya Paleti

Page updated

Report abuse