Gene Network Analysis

Gene Network Analysis is a downstream task for single-cell datasets. The objective is to infer specific gene networks, i.e., Gene Regulatory Network (GRN) or Gene Co-expression Network (GCN) from different datasets. A GRN can assist in understanding the regularory relationships between genes and predicted perturbation outcomes. A GCN can be used to analyze genes with similar functions or uncover the characteristics of genes in some diseases. GCN and GRN are two different tasks because correlation does not imply causal relation. This limitation means that we cannot determine which genes are the "causes" of expression level changes in other genes only based on embeddings similarity or correlation.ย 

We used the Immune Human Atlas dataset to evaluate the performance of inferring these two types of GCNs. The known information including marker genes, cell types, and Reactome pathways was utilized to evaluate the performance of scGPT on the GCN inferences.

From Figure 4 (d), only marker genes from two cell types showed the co-embedded and isolated relationship. They are Monocyte-derived dendritic cells and Megakaryocyte progenitors. Figure 4 (e), on the other hand, represents the cluster labels based on the Leiden clustering method. These clusters can be interpreted as groups of genes that share common functions. For marker genes from other cell types, some of them are in different clusters shown in Figure 4 (e), and some genes are co-embedded with other cell types' marker genes.ย 

Extended Data Figure 16. Examples of GCN for the Immune Human Atlas dataset. (a): An example of GCN for human immunology dataset. It is a network with CD- genes as major nodes. (b): Cell-type-level gene embeddings colored by the cell types. (c): Leiden cluster results based on the cell-type-level gene embeddings.

Extended Data Figure 16 (a) and (b) focus on gene embeddings categorized by cell types. Extended Data Figure 16 (a) shows that gene embeddings from different cell types tended to be co-embedded and there was no apparent difference. There was specific gene enrichment in the genes from Erythroid progenitors and CD 16+ Monocytes. The distribution of the remaining genes on the UMAP results was relatively random, and this could be due to two reasons: 1) the quality of gene embeddings was unsatisfactory; 2) the complex biological network in the human immune system makes the communication between cell-cell or gene-gene difficult to decompose. Additional analysis is needed for gene embeddings. Extended Data Figure 16 (b) shows the cluster results based on the Leiden algorithm, which can be interpreted as co-functional gene groups. Most of the clusters contained marker genes of different cell types. Despite these challenges, the scGPT model still demonstrates its potential in identifying functional similarities between different cell types.

The scGPT results for major CD- genes are shown in Extended Data Figure 16 (c), which shows the GCN for major CD- genes. Our results highlight the importance of critical evaluation and cross-referencing in the development of gene regulatory networks, as well as the potential and limitations of using machine learning models like scGPT for this purpose.ย 

Our results highlight the importance of critical evaluation and cross-referencing in the development of gene regulatory networks, as well as the potential and limitations of using machine learning models like scGPT for this purpose.