Perturbation Prediction

Perturbation Prediction is a task based on gene editing and single-cell sequencing technologies. After silencing some genes, we can obtain unperturbed and perturbed gene expression levels by sequencing, which allows us to explore the interactions between genes. In the perturbation prediction task, we construct the paired input-target datasets by selecting the cells with non-control guide identity and then randomly sample cells under the control condition, and then combine them as the training and testing datasets.Β 

We considered scGPT and GEARS for this task. In the training process, we masked the genes under perturbation and tried to reconstruct the expression levels of all the genes of the input cells rather than only the masked genes. We used Mean Pearson Correlation (MPC) as the metric to evaluate the performance of scGPT under different hyper-parameters or initial settings, and details can be found in Appendix \ref{appendix: metrics information}. The datasets include two perturbation conditions: a single-gene perturbation and a double-gene perturbation. Based on our experiments, scGPT could predict gene perturbation with lower MPC comparing with GEARS, but its default setting is the best design for its current structure.

Extended Data Figure 14. Tuning hyper-parameters for perturbation prediction.Β 

Figure 4 (c) summarizes results for different initial settings of scGPT for Norman, Adamson and Dixit datasets. The default setting performed best for these datasets across different settings. This indicates that the initial configuration of scGPT works well for this task. The performance was comparable between training from scratch and training from pre-trained weights. Freezing the decoder part of the model performed better than freezing the encoder part. Interestingly, this option did not lead to a large increase in error, suggesting that the encoder carries a significant portion of the important information needed for the prediction task.

Regarding the effect of hyper-parameters, Extended Data Figure 14 shows that scGPT is sensitive to adjusting the learning rate and epochs. Decreasing learning rate and increasing the number of epochs improved MPC. Increasing the dropout rate decreased MPC. The rest of hyper-parameters did not contribute much in this task.

Extended Data Figure 15. Benchmarking results of different optimizers and loss components for perturbation prediction. (a): Results of adjusting optimizers. (b): Ablation tests based on different loss components.

Extended Data Figure 15 (a) shows that AdamW and Adam are comparable for scGPT in perturbation prediction. The rest of the optimizers could significantly reduce the performance of scGPT. Moreover, based on the ablation tests shown in Extended Data Figure 15 (b), we found that different loss function components, except the masked gene expression reconstruction loss, did not contribute much in this task. Therefore, the task-specific loss component is the most important design for perturbation prediction.