scRNA-seq datasets have thousands of cells which are each profiled for the expression of many genes. The goal of clustering is to help us group cells based on their gene expression patterns, but this can be a challenge since these cells aren’t pre labelled and therefore need to be put into an unsupervised clustering algorithm which will group cells into these clusters, but their results can vary depending on their dataset, parameters, and biological complexity.
Another challenge is understanding granularity, which is the level of detail in clustering. If there is too little granularity, we risk not identifying rare cell types, for example a study that motivated us was done on the (MTG) middle temporal gyrus region of the human brain, researchers discovered an entirely new neuron type, the rosehip cell, which is an important cell for inhibiting excitatory activity in the brain, but it is something that doesn’t exist in mice and had been overlooked previously until careful transcriptomic analysis was done to a limited amount of human brain tissue. Finding these hidden cell types can be challenging due to cellular heterogeneity and the complexity of scRNA-seq data, but it's important for understanding tissues like the brain and disease development.
If we cluster too broadly aka under-clustering, we can miss these biologically meaningful differences but if we cluster too specifically, we can over-partition data into irrelevant categories. Currently many researchers manually refine and adjust clusters in an ad-hoc way, which can be time-consuming and subjective. Specifically at JCVI there are is a bulk of single cell RNA seq datasets that researchers would like to be clustered quickly to sort of cross-check.
Page Contributor: Alexandra Wood