Evaluation methods
Here we describe the details of the implementation of both the single-cell LLMs and task-specific methods.
Single-cell LLMs
tGPT is a single-cell LLM based on the GPT-2 structure. It utilized large-scale scRNA-seq datasets for pre-training and set the pre-training task as predicting gene expression rankings. The downstream applications of tGPT follow the zero-shot learning framework and include clustering, batch effect correction, and bulk RNA-seq analysis.
scBERT is a pre-training-based single-cell LLM focusing on cell-type prediction. It is based on Performer with gene embeddings initialized by Gene2vec. It has six self-attention blocks. The default fine-tuning process of scBERT on downstream datasets is freezing the penultimate layer. scBERT is considered for the Cell-type Annotation task.
Geneformer is a single-cell LLM using transfer learning to predict cell types and gene functions. The Geneformer tokenization step is done based on ranking gene expression values in single cells following scaling across the whole training dataset. Cells are represented as token strings with genes rankings as tokens. Geneformer is used for the Cell-type Annotation task and the Gene Function Prediction task.
CellLM is a single-cell LLM using three different pre-training strategies. The pre-training loss function includes: 1) masked gene expression level reconstruction; 2) cell condition discrimination; and 3) self-supervised contrastive learning. Moreover, they incorporated protein-protein interaction networks as prior information during the pre-training process. The downstream tasks of CellLM are all related to Cell-type Annotation.
scFoundation employs a pre-training methodology similar to BERT and introduces Bayesian down-sampling as a data pre-processing step. The input data of scFoundation also contain target total counts and input total counts as extra information. The downstream tasks of scFoundation include clustering (a function of cell embeddings across all models), drug response prediction (belong to Cell-type Annotation) and Perturbation Prediction. We did not evaluate the performance of scFoundation because it was closed-source.
SCimilarity declares that it serves as a foundation model for new data querying or searching based on the cell embeddings generated from known large-scale scRNA-seq datasets. The downstream tasks of SCimilarity include Batch Effect Correction and Cell-type Annotation. We did not evaluate the performance of SCimilarity because it was closed-source.
Task-specific MethodsÂ
ResPAN is a batch effect correction tool based on Generative Adversarial Network (GAN). The high-level idea of ResPAN is based on the distribution alignment or domain adaption across data from different batches. Such requirement can be treated as optimal transport, which can be accomplished by training a GAN. ResPAN is used for the batch effect correction task and the Multi-omics Data Integration task.
scVI is a batch effect correction tool based on variational inference and variational auto-encoder. scVI encodes the gene expression data with batch information using a neural network and set the output of the network as parameters for a distribution. Based on such distribution of the latent space, scVI can correct the batch effect in the latent space as well as the original space, as long as we consider the output of the decoder model.Â
Vanilla NNs is the neural network contains three MLP layers with batch normalization and uses Mish as the activation function. Vanilla NNs are used for the Cell-type Annotation task and the Gene Function Prediction task.
TOSICA is a deep learning-based method for one-stop cell type annotation. TOSICA is designed with the self-attention multi-heads transformer without pre-training. It also provides interpretation for researchers about the attention embeddings and uses attention embeddings to perform biological analysis.
GEARS is a tool for single and multi-gene perturbation prediction based on single-cell RNA sequencing datasets. It combines gene-gene interaction network as prior information and uses a cross-gene neural network with a graph neural network to predict gene expression after perturbation.
Tangram is a toolbox for spatial transcriptomic data analysis based on neural networks. The key idea behind Tangram is using neural networks to find a good mapping function from single-cell data space to spatial data space. After the mapping process, by integrating the information from the single-cell level and spatial level, it can perform several downstream tasks, including data imputation, cell-type deconvolution and others.
scDesign3 is a model based on Copula distribution to generate different single-cell datasets. Such datasets can be multimodal. Moreover, based on the input parameters and requirements of scDesign3, it can also generate datasets with specific conditions, including batch effect, cell conditions, and the stages of cell differentiation. The data generation of scDesign3 is based on real datasets.