Cai lab - Research

Research

Our goal is to understand the nature of complex traits and diseases, with a particular focus on psychiatric disorders like major depressive disorder (MDD).

Figure: Phenotype definitions for EHR phenotypes, deep phenotypes and predicted liability phenotypes in the EDGAR framework [4].

Figure: Phenotype definitions for EHR phenotypes, deep phenotypes and predicted liability phenotypes in the EDGAR framework, and the EDGAR framework for predicting lifetime disease liabilities [4].

Maximizing the utility of electronic health record (EHR) and biobank data for genetics research

Genetic studies rely heavily on meta-analysis of multiple cohorts to reach sample sizes that would provide adequate sample sizes for statistical significance. Recent studies of psychiatric diseases, in particular, mostly attempt to meta-analyse as many cohorts as possible, without regard to phenotypic heterogeneity between cohorts that may be due to misdiagnosis from other disorders and other heritable confounders [1]. Many of our previous works went into demonstrating this is an important problem [1], and how it may be solved through integrating information from other phenotypically and genetically correlated phenotypes [2,3]. Yet, most of these previous efforts focused on relatively rich questionnaire data. Our newest efforts tackle sparse, sequential and systemically biased electronic health record diagnostic codes [4]. Overall, we believe that the ideal phenotypes to use in genetic studies are lifetime disease liabilities that are free of systemic bias that is prevant in EHRs and biobanks. As such, our on-going work focus on developing deep-learning models to predict lifetime disease liabilities, disease trajectories, and treatment response phenotypes based on raw and biased EHR/biobank data, leveraging disease-specific biomarkers and gold-standard labels in small numbers of individual for model training.

References:

[1] Minimal phenotyping yields genome-wide association signals of low specificity for major depression. Cai, N., Revez, J.A., Adams, M.J. et al. Nat Genet (2020). doi: 10.1038/s41588-020-0594-5

[2] Phenotype integration improves power and preserves specificity in biobank-based genetic studies of major depressive disorder. Dahl, A., Thompson, M., An, U. et al. Nat Genet (2023). doi: 10.1038/s41588-023-01559-9

[3] Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries. An, U., Pazokitoroudi, A., Alvarez, M. et al. Nat Genet (2023). doi: 10.1038/s41588-023-01558-w

[4] Learning lifetime disease liability reveals and removes genetic confounding in electronic health records. Di Y. and Cai N. MedRxiv (2026)

Identifying relevant cell types and genes for complex traits and diseases

It is the ultimate goal of genetic association studies to uncover the biology of a disease. To do this, we need to identify ways to link genetic effects identified through GWAS or rare-variant association studies (RVAS) to molecular processes, such as regulation of gene expression [1] or affecting particular gene functions [2], both of which we have previously explored. We are now interested in identifying which of these enrichments of genetic effects or rare variant-based gene associations are shared across cell types or specific to particular cell types relevant to disease, and working on extending our previous models to simultaneously genes, gene features, and cell types relevant to complex traits and diseases. As many diseases, including our main focal disease MDD, are heterogeneous in pathology, one future extension of our work would be to adapt our model so that it maximizes the identification of "sets" of genes that lead to different disease subtypes [3]. Overall, our work aims to identify the genetic effects, the molecular layers they affect, and the cell or tissue types they act in all at once.

Overview of the BayesRVAT framework. (A) In rare variant association tests (RVATs), rare variants X and their annotations A are aggregated into a gene burden score, which is tested for association with the phenotype y.

Figure: Overview of the BayesRVAT framework. (A) In rare variant association tests (RVATs), rare variants X and their annotations A are aggregated into a gene burden score, which is tested for association with the phenotype y.

References:

[1] Leveraging eQTLs to identify individual-level tissue of interest for a complex trait. Majumdar, A., et al. PLoS Comp Biol (2021). doi: 10.1371/journal.pcbi.1008915

[2] BayesRVAT enhances rare-variant association testing through Bayesian aggregation of functional annotations. Nappi, A., et al. Genome Research (2025). doi: 10.1101/gr.280689.125

[3] Genetic risk effects on psychiatric disorders act in sets. Rietkerk J., et al. MedRxiv (2025). doi: 10.1101/2025.07.23.25332043

Use of our SpaceDX model [4] to simultaenously identify the genes and region of a spatial transcriptomic sample that has differential gene expression between stressed and non-stressed mice.

Figure: Use of our SpaceDX model [4] to simultaenously identify the genes and region of a spatial transcriptomic sample that has differential gene expression between stressed and non-stressed mice.

Spatio-temporal gene regulatory changes due to genetic and environmental perturbations

Stress is known to be a major risk factor to MDD and interact with genetic effects on MDD [1]. We therefore think stress mimics some genetically regulated biological mechanism in its contribution to MDD; studying how stress affects gene expression in the brain (of lab mice) may therefore give us insights into the biological underpinnings of MDD. To do this, we have derived matched single cell RNA sequencing (scRNAseq, 10x Chromium) and spatial sequencing (10x Visium) on mice put in six different contexts [3] as well as across a period of chronic stress in both adult [4] and young mice. In addition to producing experimentally spatio-temporally resolved single cell gene expression data on mouse subject to different environmental contexts and stressors, we have produced methods to identify, simultaneously, the regions (in a spatial transcriptomics slice) and the genes showing differential gene expression in mouse subject to different contexts (e.g. stress vs not stressed)[4]. Our on-going work focuses on a) generalizing these approaches to any form of perturbation, including gene-level perturbseq, b) exploring changes in intercellular interactions due to perturbations using spatial transcriptomics data, and c) asking how and which, if any, of the gene expression changes that result from environmental or genetic perturbations can be identified as being relevant to disease.

References:

[1] Molecular genetic analysis subdivided by adversity exposure suggests etiologic heterogeneity in major depression. Peterson, RE., et al. AJP (2018). doi: 10.1176/appi.ajp.2017.1706062

[2] Tripartite extended amygdala–basal ganglia CRH circuit drives locomotor activation and avoidance behavior. Chang, S., et al. Science Advances (2022). doi: 10.1126/sciadv.abo1023

[3] Molecular and neural mechanisms of behavioural integration in the extended-amygdala. Chang, S., et al. bioRxiv (2024) doi: 10.1101/2024.04.29.591588

[4] SpaceDX: A Bayesian test for localized differential expression in population-level spatial transcriptomics datasets. Stotzem, N., et al. ICLR Workshop MLGenX (2025)

Page updated

Google Sites

Report abuse