Batch Effect Correction is an essential step following scRNA-seq data pre-processing. It primarily signifies the distribution disparity in scRNA-seq datasets originating from the same tissue, which can be attributed to various factors. The reduction of batch effects is critical not only to allow researchers to discern genuine biological signals but also to facilitate integrated analyses across different studies.Â
Extended Data Figure 1. Visualization of batch score and biology conservation score for each dataset.
We considered scGPT, tGPT and ResPAN for this task. We also provide a detailed analysis of the influence of various hyper-parameters on the performance of scGPT in batch effect correction. As shown in Figure 3 (a), scGPT outperformed ResPAN in three of the nine datasets and outperformed tGPT in all datasets for batch effect correction, while ResPAN had an overall best correction. In addition, scGPT outperformed the scGPT full model overall, raising the issue of the need for increasing the size of pre-training datasets. Moreover, scGPT had worse performance in reducing the batch effect for large-scale datasets, as their biology conservation scores were lower than those of raw datasets (Extended Data Figure 1). Extended Data Figures 2 and 3 show the UMAP plots for the raw data and the scGPT results.
Extended Data Figure 2. UMAPs for raw data and embeddings of scGPT after batch effect correction.
We provide a detailed analysis of the impact of various hyper-parameters on the performance of scGPT in batch effect correction based on Figure 3 (b) and Extended Data Figure 4. A smaller learning rate tended to lead to better performance across all datasets. The optimal number of training epochs varied across datasets, with a larger number of epochs being beneficial for most datasets. Increasing the number of bins is generally associated with an increase in the final score. The impact of the mask ratio and dropout rate on model performance is unclear, suggesting that further investigation is needed to understand their influence.Â
Extended Data Figure 3. UMAPs for raw data and embeddings of scGPT after batch effect correction.
Extended Data Figure 5 (a) presents the comparison of scores across different initial settings for the batch effect correction task using scGPT. We can see that scGPT is capable of performing zero-shot learning tasks. Pre-training significantly contributes to the performance of scGPT in the batch effect correction task. Without pre-training, the model's performance notably decreased. Using cross-entropy as the loss function for gene expression reconstruction yielded better results than the mean square error loss for most datasets. Freezing weights is not crucial for batch effect correction. Interestingly, the encoder structure appears to play a more significant role in the training process, as freezing the encoder layer led to a larger decrease in score. Incorporating cell type as a human label into the training process enhanced performance for most datasets.Â
Extended Data Figure 4. Benchmarking results for different hyper-parameters for Batch Effect Correction.
Extended Data Figure 5 (b) shows the performance metrics versus the choices of different optimizers for the batch effect correction. Extended Data Figure 5 (c) illustrates the impact of different loss function components on the performance of batch effect correction using scGPT. Using all components of the loss function did not always yield the best results, with the exceptions of the Pancrm, MCA, and MHSP datasets. Using only the gradient reverse approach resulted in the worst performance. The GEPC Loss seemed to play a crucial role in the performance of the batch effect correction task.Â
Extended Data Figure 5. Overall evaluation of different components of scGPT. (a): The final score of different settings across different datasets. (b): The final score of different optimizers across different datasets. c: The final score of including different loss function terms across different datasets. The number of stars represents the significance level (***: p-value<0.005, **: p-value<0.05, *: p-value<0.1). The numbers on the left side of each sub-figure represent the average score across different datasets for one condition.
These results suggest the need for a careful composition of the loss function when training single-cell LLMs for batch effect correction, with each loss function component contributing differently to the model performance.