Cell-type Annotation is another key step following single-cell data pre-processing. This step annotates each cell with its accurate cell type label, which can be achieved through prior knowledge or computational methods. These annotated cell-type labels can provide essential biological information for further downstream analysis, such as cell-type specific network analysis. In addition, drug response prediction or single-cell disease classification can also be treated as a variation of this task. A common approach employed by single-cell LLMs in dealing with the cell-type annotation task is to use single-cell datasets for model training and treat the unannotated datasets as testing datasets.Â
In the cell-type annotation task, we choose datasets with batch effect in two different cases. The intra-dataset case allows batch intersection, which means that the training and testing datasets can contain cells from the same batch. We consider Precision, Recall and F1 scores in the analysis for ablation test.Â
Extended Data Figure 7. UMAPs for ground truth cell types and prediction results based on scGPT.
We considered Geneformer, scGPT, scBERT, CellLM and TOSICA for this task. We assessed the performance of different single-cell LLMs in assigning cell types based on the four metrics discussed in Appendix C.2. The UMAPs for the raw data and scGPT are shown in Extended Data Figures 7 and 8. We observed accurate annotation results on these UMAPs. Figure 3 (e) displays the Accuracy for different datasets for the five models. On average, models with pre-training performed better than those without pre-training. However, CellLM did not perform well across all the datasets. Moreover, for the intra-dataset prediction task, all the single-cell LLMs were comparable even if they had different pre-training settings. Different single-cell LLMs also had large divergences in performance.Â
Extended Data Figure 8. UMAPs for ground truth cell types and prediction results based on scGPT.
In Figure 3 (f) and Extended Data Figure 9, we compared the performance of models with different hyper-parameter settings. Higher loss weight, learning rate, ECS threshold, mask ratio and smaller epochs tended to lead to worse performance of scGPT. There was little correlation between the number of bins and the performance of scGPT. We observed the consistency in the performance of different single-cell LLMs under the condition of altering their shared hyper-parameters. For Geneformer and scBERT, lower learning rate and higher epochs also tended to lead to better performance.
Extended Data Figure 9. Tuning parameters for cell-type annotation. Sub-figures represent the score of scGPT under different hyper-parameters after training.
We also considered different initial settings for model training. Extended Data Figure 10 (a) shows the score versus initial settings across different datasets. Here we considered scGPT and scBERT. We omitted Geneformer because it requires pre-training weights as input. It can be seen that pre-training always improved results for scGPT, especially for the cross-dataset conditions. However, there was little benefit of pre-training for scBERT. For both cases, freezing the pre-training layers and letting them not be involved in the fine-tuning process was not recommended. In some cases, the fine-tuning performance of such freezing was worse than training from the scratch. Transfer learning for different species is possible because, for the MCA dataset, pre-training based on human data can help predict cell types for the mouse.Â
For the same type of GPU, the training process of scGPT was faster than scBERT and Geneformer, with more GPU memory usage, according to Extended Data Figure 10 (b).
Extended Data Figure 10. Results of different settings, running time, and memory usage for cell-type annotation task. a: Accuracy of scGPT and scBERT for the Cell-type Annotation task across different datasets. b: Scaled running time (up) and scaled memory usage (down) statistics for all three single-cell LLMs.
In Extended Data Figure 11, we explored the performance of scGPT based on different optimizers across four datasets. Adam, AdamW, and Lion were comparable, while SGD was worse than them but better than Sophia-G, which was not stable. Moreover, we explored the contribution of different loss function components towards the cell-type Annotation task, and shows that Figure 3 (g) the existence of mask loss is important. Moreover, the default setting is generally good across different tasks. Based on precision and recall of Figure 3 (g), the effect of different loss function terms had less effect on precision and more effect on recall. Such difference could affect the final F1 score. Removing the GEPC loss function terms improved the cell type prediction for the DC, MHSP and MB Spatial datasets, and did not affect the prediction performance for the other datasets.
Extended Data Figure 11. Benchmarking results for different optimizers for cell-type annotation.
Therefore, single-cell LLMs can handle the Cell-type Annotation task with suitable pre-training data and model structure, but one pre-training framework is not consistently good across all the datasets.