Gene Function Prediction

Gene Function Prediction is important to identify the properties of genes across different conditions. Since we have approximately 20,000 protein-encoding genes for humans and only some are annotated with functions. Accurate prediction of gene function can help us understand and infer the role of genes in biological systems. Here we consider three types of functions for gene:

We considered Geneformer, scGPT and Vanilla NN for this task. On average, Geneformer and scGPT performed well in this task, and there is a performance gap between single-cell LLMs and Vanilla NN.

Extended Data Figure 12. Tuning hyper-parameters for gene function prediction.ย 

Figure 4 (b) and Extended Data Figure 12 show the accuracy of different hyper-parameter settings. Smaller learning rate and loss weight tended to lead more accurate results. Geneformer was more sensitive to Epoch compared to scGPT. For scGPT, pre-training contributed more than fine-tuning in this task as increasing epochs did not affect the model performance. Only tuning the number of bins, mask ratio, dropout rate and ECS threshold did not affect the prediction results.ย 

Extended Data Figure 13. Benchmarking results of different initial settings, optimizers and loss components for gene function prediction. T1-T3 represent different gene prediction cases. (a): Results of adjusting initial settings. (b): Results of adjusting optimizers. (c): Ablation tests based on different loss components.

In Extended Data Figure 13 (a), we considered different initial settings for model training. It can be seen that pre-training always improved results for scGPT. Moreover, freezing the whole model did not affect the performance of scGPT. Extended Data Figure 13 (b) shows the performance of scGPT based on different optimizers. Adam and AdamW were comparable, while Lion was worse than them but better than SGD and Sophia-G. Extended Data Figure 13 (c) shows the ablation test results of scGPT for this task. There was no significant difference by comparing the default setting and those the model without certain components. Therefore, the task-specific loss function is the most important design for this task.