Imputation and Simulation Analysis

Figure 5 shows the experimental results for imputation and simulation analysis. In Figure 5, (a) and (b) refer to the evaluation of Imputation, (c)-(e) refer to the evaluation of Simulation. Our evaluation based on Imputation and Simulation Analysis shows that further improvement of single-cell LLMs is needed.Β 

Figure 5. Experimental results of the Imputation task and the Simulation task. The significance level was computed based on paired Student's t-test. The number of stars represents the significance level (***: p-value<0.005, **: p-value<0.05). (a): Comparison of the average bio score between the raw data and imputed data by scGPT in the scRNA-seq imputation task. (b): Comparison of the average bio score, average correlation score, and average significance level score among Tangram, scGPT, and scGPT (zero-shots) in spatial transcriptomics imputation task. (c): Comparison of the average bio score between scDesign3 and scGPT for simulation. (d): Gene-gene correlation heatmap from the raw HumanPBMC dataset. We select the subset of the top 100 highly variable genes. (e): Comparison of different simulation methods by correlation. The heatmap represents the top 100 highly variable genes (for raw and scDesign3) or the subset of the top 100 highly variable genes (for scGPT) based on the HumanPBMC dataset. The correlation "r" represents the Pearson correlation between the gene correlation of raw data and the gene correlation of simulation data based on the HumanPBMC dataset.

Imputation Analysis

We considered scGPT and Tangram for this task. The imputation results for scRNA-seq are summarized in Figure 5 (a), which suggest that the imputation function of scGPT for scRNA-seq data introduced more noise into the original sequencing data, suggesting the unreliability of the decoder's output.Β 

According to Figure 5 (b), scGPT performed well in the spatial transcriptomic data imputation task compared to the SOTA spatial imputation method, Tangram. Based on the evaluation of correlation and significance proportion, the imputation results of scGPT are better than the results of Tangram. Moreover, the scores of these two metrics based on the zero-shot learning version were even better than the pre-training version with scRNA-seq data. However, based on the results of the average bio score evaluation, the raw data had better scores. This could be caused by the sources of the spatial clustering labels, which were generated from gene expression clusters rather than expert annotation. Such methods could introduce bias before and after imputation.

Extended Data Figure 17. Deferentially expressed genes discovery based on results before imputation and after imputation. We used the Mouse scRNA-seq and the Mouse spatial transcriptomic datasets as examples. (a): Deferentially expressed genes by cell types for scRNA-seq data based on pre-imputation data (top) and post-imputation data (bottom). (b): Deferentially expressed genes by cluster types for spatial transcriptomic data based on pre-imputation data (top), post-imputation data based on zero-shot learning (middle) and post-imputation data based on fine-tuning (bottom). (c): Deferentially expressed genes by cluster types for spatial transcriptomic data based on Tangram.

Extended Data Figure 17 shows the results of Deferentially Expressed Genes (DEGs) discovery based on pre-imputation data and post-imputation data. Results in Extended Data Figure 17 (a) showed that scRNA-seq imputation was not reliable because the expression patterns of all genes were similar after imputation based on scGPT. However, based on Extended Data Figure 17 (b), we found that DEGs after imputation based on scGPT did not contain MT genes. Moreover, based on Extended Data Figure 17 (b) and (c), the DEG patterns after imputation based on scGPT and Tangram are similar. Thus, scGPT has the potential to produce biologically meaningful imputation results.

Simulation Analysis

scRNA-seq simulation is a data generation task. Leveraging the generative pre-training process of scGPT, we can generate new gene expressions based on real datasets. Since a prevalent issue with scRNA-seq data simulation is the considerable divergence between simulation datasets and real datasets, direct generation from real datasets is preferred. By arranging different sequences of masking genes or altering different seeds, we can generate new simulated scRNA-seq datasets from real ones. The quality of our simulation datasets can be evaluated by comparing them with the outputs of current simulation methods. We treat this task as a data-generation problem.

Extended Data Figure 18. UMAPs for the simulation results with batch effect, using HumanPBMC dataset as an example.

We considered scGPT and scDesign3 for this task. For conditions incorporating batch effects, we employed the same metrics used in the evaluation of batch effect correction. In scenarios without batch effects, our metrics are primarily focused on assessing the preservation of biological information. As shown in Figure 5 (c)-(e), scDesign3 outperformed scGPT across two conditions of the simulation task. In particular, scDesign3 had a more pronounced superiority in generating simulation data without batch effects, in comparison to scGPT. This is consistent with the results shown in Figure 5 (d)-(e). The gene-gene correlation from scDesign3 is also more similar to the gene-gene correlation of the raw data. Therefore, the simulation task needs to be improved for single-cell LLMs.

Extended Data Figure 19. UMAPs for the simulation results with batch effect, using HumanPBMC dataset as an example.

In addition, we present UMAPs of the output produced by different methods in Extended Data Figures 18 and 19 that illustrate the advantage of scDesign3. The embeddings of scGPT with the no batch effect settings tended to preserve the batch effect, while the embeddings with batch effect tended to remove the batch effect.