Multi-omics Data Integration

Multi-omics Data Integration is a key for multi-omics data analysis. It is akin to an advanced form of batch effect correction. If unpaired multi-omics data are present, the objective is to map different datasets into a shared space for subsequent analysis. If paired multi-omics data are present, the goal is to assess whether the use of multi-omics data can contribute to learning a more comprehensive representation of the data. A significant challenge here is how to align omics at the feature level.Β 

In this task, we seek to integrate single-cell RNA-sequencing (scRNA-seq) datasets with single-cell ATAC-sequencing (scATACseq) datasets. We assessed the integration quality through the same score as Batch Effect Correction. The results presented in Figure 3 (c) summarize the impact of initial setting choices on the performance of scGPT for the multi-omics integration tasks. As was the case with batch effect correction, the cross-entropy loss function led to better performance compared to the Mean Squared Error (MSE) loss for this task. Interestingly, pre-training did not significantly influence the performance for this task. The encoder part of the single-cell LLM played a more important role than the decoder. Including cell types or human labels in the training process proved beneficial, likely providing the model with more precise and useful information for the task. The zero-shot learning approach did not perform as well for this task as it did for batch effect correction.

Extended Data Figure 6. Tuning parameters for multi-omics data integration. (a)-(g) represent the score of scGPT under different hyper-parameters after training.

We illustrate the evaluation metrics versus different parameter settings in Figure 3 (d) and Extended Data Figure 6. scGPT did not perform well on this task as shown by the low score (below 0.5). This is also shown in the UMAP results. Certain parameters affected the training process, with a smaller weight for the loss function and mask ratio, and more epochs, improved the model's performance. Setting the learning rate too high caused the model to collapse.