Figure 6 shows the experiment results for emergent ability analysis and stability analysis. In Figure 6, (a)-(d) refer to the analysis of emergent abilities, and (c) refers to the evaluation of stability. We proved that single-cell LLMs have emergent abilities but the stability of single-cell LLMs should be improved.ย
Figure 6. Different comparison groups for emergent ability analysis and stability analysis. (a): The model scale of different methods. (b): Accuracy of LLMs and vanilla NN in Cell-type Annotation task. The dataset here is the Pancreas cross dataset. (c): Accuracy of LLMs and vanilla NN in Cell-type Annotation task. The datasets here are MB Spatial and MCA. (d): Overall score comparison including ResPAN and different settings of scGPT. The dataset here is the human spatial transcriptomic dataset. (e): Different batch correction scores of different models based on changing random seeds (left) and different average classification scores of different models based on changing random seeds (right). The bold black line represents the median value while the length of each box can be interpreted as the variance level.
We discuss the emergent abilities of single-cell LLMs, with scBERT, Geneformer, and scGPT. We considered three scenarios to investigate the emergent abilities:
Cross-data cell type prediction: our anticipated emergent ability would be a significant improvement in prediction accuracy for single-cell LLMs compared to Vanilla NNs of a smaller size. Figure 6 (a) provides an overview of the different model sizes. From Figure 6 (b), it is evident that scGPT and Geneformer outperformed NNs in terms of accuracy in cross-data scenarios, suggesting the emergent abilities in the cross-data cell-type annotation task. The low accuracy of scBERT may be caused by its default fine-tuning setting and/or the pre-trained model weights.
Cross-species analysis: the desired emergent ability mirrors that of the first task. Figure 6 (c) shows that Vanilla NN outperformed scGPT for both the MCA and MB Spatial datasets. Vanilla NN was comparable to Geneformer for MCA dataset and worse than Geneformer for MB spatial dataset. Moreover, Vanilla NN was better than scBERT for the MCA dataset and worse than scBERT for the MB spatial dataset.
Spatial transcriptomics analysis: Figure 6 (d) suggests the emergent ability for batch effect correction. The fine-tuning process appeared beneficial in reducing the batch effect inherent in the spatial data, whereas scenarios without pre-training yielded subpar results. However, the performance of scGPT in the integration of spatial data was still worse than that of ResPAN. Figure 6 (c) shows that we did not detect any emergent ability pertaining to the second component in the MB Spatial data annotation task.
To analyze the stability of single-cell LLMs, we selected Batch Effect Correction and Cell-type Annotation as two representative tasks and varied the seeds of single-cell LLMs to investigate the model stability. These two tasks are the main tasks in single-cell data analysis and have solid metrics for evaluation. Ideally, the results of different single-cell LLMs should not vary substantially across different datasets. We also considered stability for other benchmarking tools. Our experiment results showed that the stability for single-cell LLMs is task-specific.
Based on Figure 6 (e), the variance of scVI and ResPAN was generally lower than that of scGPT, and scVI and ResPAN also had a higher score on average. Therefore, single-cell LLMs were not as stable as SOTA deep-learning-based methods for the Batch Effect Correction task. Figure 6 (f) suggests that the variance of Geneformer was generally smaller than scGPT and scBERT. All three models had high median average scores. Moreover, the variance of scBERT was relatively large in the experiments based on the MCA dataset, which implies that single-cell LLMs might fail under certain random seeds.