Overview

This page includes an overview of the key results from the paper. Although we provide them here, we highly encourage you to read the paper for more in-depth discussion.

In this manuscript, we present a framework for assessing various single-cell LLMs and tasks, termed Single-cell Large Language Model Evaluation, scEval, that evaluates factors affecting single-cell LLM trainings on various tasks.ย 

Figure 1. Overview of single-cell LLMs, landscape of scEval and factors affecting single-cell LLMs. (a): Overview of single-cell LLMs describing the typical structure of LLMs and general tasks for single-cell data analysis. The right two blocks represent two types of downstream tasks. Yellow block: Sub-task 1, including Cell-type Annotation and Gene Function Prediction (top to bottom). Blue block: Sub-task 2, including Batch Effect Correction, Multi-omics Data Integration, Imputation (From left to right, top row), Perturbation Prediction, Gene Network Analysis, and Simulation (From left to right, bottom row). (b): The landscape of scEval shows the workflow of our systematic evaluation analysis. (c): Factors which can affect the performance of single-cell LLMs. The known factors can be classified into four different types.

We evaluated the performance of five open-source single-cell LLMs (scGPT, Geneformer, scBERT, CellLM and tGPT) by assessing the performance of LLM on eight tasks with 22 datasets. The tasks that can performed for different models as well as the overall ranks are summarized in Figure 2. We also compared their performance with state-of-the-art (SOTA) task-specific methods. For each task, we discuss the effect of different factors for the performance of single-cell LLMs. For the emergent abilities, we consider the contribution of model size to the performance of LLMs. Finally, we evaluate the stability and usability of different single-cell LLMs and make recommendations for preferred models.ย 

Figure 2. Table of criteria to consider when choosing a model based on the breadth of tasks and usability. The white blanks represent that selected models do not fill the criterion because they do not have such design for specific tasks. Here the green circles represent functions implemented by the original model, and the purple circles represent fucntions added by scEval for evaluation. The top 3 methods are highlighted.

In summary, the goal of studying single-cell data-based LLMs can be to come up with a foundation model that is capable of performing multiple tasks with stable and reliable results. Such a model should also be user-friendly with detailed tutorials or websites.ย