Research

1. Statistical methods in spatial transcriptomics

Shi X, Yang Y, Ma X, ... , Liu J*. Probabilistic cell/domain-type assignment of spatial transcriptomics data with SpatialAnno. Nucleic Acids Research,  2023, 51(22), e115. [software]

     In the analysis of both single-cell RNA sequencing (scRNA-seq) and spatially resolved transcriptomics (SRT) data, classifying cells/spots into cell/domain types is an essential analytic step for many secondary analyses. Most of the existing annotation methods have been developed for scRNA-seq datasets without any consideration of spatial information. Here, we present SpatialAnno, an efficient and accurate annotation method for spatial transcriptomics datasets, with the capability to effectively leverage a large number of non-marker genes as well as ‘qualitative’ information about marker genes without using a reference dataset. Uniquely, SpatialAnno estimates low-dimensional embeddings for a large number of non-marker genes via a factor model while promoting spatial smoothness among neighboring spots via a Potts model. Using both simulated and four real spatial transcriptomics datasets from the 10x Visium, ST, Slide-seqV1/2, and seqFISH platforms, we showcase the method’s improved spatial annotation accuracy, including its robustness to the inclusion of marker genes for irrelevant cell/domain types and to various degrees of marker gene misspecification. SpatialAnno is computationally scalable and applicable to SRT datasets from different platforms. Furthermore, the estimated embeddings for cellular biological effects facilitate many downstream analyses.

Liu Ws, Liao Xs, Luo Z,  Yang Y, Lau MC, ... ,Liu J*. Probabilistic embedding, clustering, and alignment for integrating spatial transcriptomics data with PRECAST. Nature Communications, 2023, 14(1), 296. [software]

     Spatially resolved transcriptomics involves a set of emerging technologies that enable the transcriptomic profiling of tissues with the physical location of expressions. Although a variety of methods have been developed for data integration, most of them are for single-cell RNA-seq datasets without consideration of spatial information. Thus, methods that can integrate spatial transcriptomics data from multiple tissue slides, possibly from multiple individuals, are needed. Here, we present PRECAST, a data integration method for multiple spatial transcriptomics datasets with complex batch effects and/or biological effects between slides. PRECAST unifies spatial factor analysis simultaneously with spatial clustering and embedding alignment, while requiring only partially shared cell/domain clusters across datasets. Using both simulated and four real datasets, we show improved cell/domain detection with outstanding visualization, and the estimated aligned embeddings and cell/domain labels facilitate many downstream analyses. We demonstrate that PRECAST is computationally scalable and applicable to spatial transcriptomics datasets from different platforms.

Liu Ws, Liao Xs, Yang Ys, Lin H, Yeong J, Zhou X*, Shi X*, & Liu J*. Joint dimension reduction and clustering analysis of single-cell RNA-seq and spatial transcriptomics data. Nucleic Acids Research, 2022, 50(12): e72-e72. [software]

     Dimension reduction and (spatial) clustering is usually performed sequentially; however, the low-dimensional embeddings estimated in the dimension-reduction step may not be relevant to the class labels inferred in the clustering step. We therefore developed a computation method, Dimension-Reduction Spatial-Clustering (DR-SC), that can simultaneously perform dimension reduction and (spatial) clustering within a unified framework. Joint analysis by DR-SC produces accurate (spatial) clustering results and ensures the effective extraction of biologically informative low-dimensional features. DR-SC is applicable to spatial clustering in spatial transcriptomics that characterizes the spatial organization of the tissue by segregating it into multiple tissue structures. Here, DR-SC relies on a latent hidden Markov random field model to encourage the spatial smoothness of the detected spatial cluster boundaries. Underlying DR-SC is an efficient expectation-maximization algorithm based on an iterative conditional mode. As such, DR-SC is scalable to large sample sizes and can optimize the spatial smoothness parameter in a data-driven manner. With comprehensive simulations and real data applications, we show that DR-SC outperforms existing clustering and spatial clustering methods: it extracts more biologically relevant features than conventional dimension reduction methods, improves clustering performance, and offers improved trajectory inference and visualization for downstream trajectory inference analyses.

Yang Ys, Shi X, Zhou Q, ..., Liu J*. SC-MEB: spatial clustering with hidden Markov random field using empirical Bayes. Briefings in Bioinformatics,  2022, 23(1): bbab466. [software]

     Spatial transcriptomics has been emerging as a powerful technique for resolving gene expression profiles while retaining tissue spatial information. These spatially resolved transcriptomics make it feasible to examine the complex multicellular systems of different microenvironments. To answer scientific questions with spatial transcriptomics and expand our understanding of how cell types and states are regulated by microenvironment, the first step is to identify cell clusters by integrating the available spatial information. Here, we introduce SC-MEB, an empirical Bayes approach for spatial clustering analysis using a hidden Markov random field. We have also derived an efficient expectation-maximization algorithm based on an iterative conditional mode for SC-MEB. In contrast to BayesSpace, a recently developed method, SC-MEB is not only computationally efficient and scalable to large sample sizes but is also capable of choosing the smoothness parameter and the number of clusters. We performed comprehensive simulation studies to demonstrate the superiority of SC-MEB over some existing methods. We applied SC-MEB to analyze the spatial transcriptome of human dorsolateral prefrontal cortex tissues and mouse hypothalamic preoptic region. Our analysis results showed that SC-MEB can achieve a similar or better clustering performance to BayesSpace, which uses the true number of clusters and a fixed smoothness parameter. Moreover, SC-MEB is scalable to large ‘sample sizes’. We then employed SC-MEB to analyze a colon dataset from a patient with colorectal cancer (CRC) and COVID-19, and further performed differential expression analysis to identify signature genes related to the clustering results. The heatmap of identified signature genes showed that the clusters identified using SC-MEB were more separable than those obtained with BayesSpace. Using pathway analysis, we identified three immune-related clusters, and in a further comparison, found the mean expression of COVID-19 signature genes was greater in immune than non-immune regions of colon tissue. SC-MEB provides a valuable computational tool for investigating the structural organizations of tissues from spatial transcriptomic data.

2. Causal inference with applications in genetics

Cheng Qs, Zhang Xs, Chen L*, Liu J*. Mendelian randomization accounting for complex correlated horizontal pleiotropy while elucidating shared genetic etiology. Nature Communications, 2022, 13(1): 6490. [software]

     Mendelian randomization (MR) harnesses genetic variants as instrumental variables (IVs) to study the causal effect of exposure on outcome using summary statistics from genome-wide association studies. Classic MR assumptions are violated when IVs are associated with unmeasured confounders, i.e., when correlated horizontal pleiotropy (CHP) arises. Such confounders could be a shared gene or inter-connected pathways underlying exposure and outcome. We propose MR-CUE (MR with Correlated horizontal pleiotropy Unraveling shared Etiology and confounding), for estimating causal effect while identifying IVs with CHP and accounting for estimation uncertainty. For those IVs, we map their cis-associated genes and enriched pathways to inform shared genetic etiology underlying exposure and outcome. We apply MR-CUE to study the effects of interleukin 6 on multiple traits/diseases and identify several S100 genes involved in shared genetic etiology. We assess the effects of multiple exposures on type 2 diabetes across European and East Asian populations.

Cheng Qs, Qiu T, Chai X, Sun B, Xia Y, Shi X, Liu J*. MR-Corr2: a two-sample Mendelian randomization method that accounts for correlated horizontal pleiotropy using correlated instrumental variants. Bioinformatics, 2022, 38 (2): 303-310. [software]

       To account for this correlated HP, we propose a Bayesian approach, MR-Corr2, that uses the orthogonal projection to reparameterize the bivariate normal distribution for and ⁠, and a spike-slab prior to mitigate the impact of correlated HP. We have also developed an efficient algorithm with paralleled Gibbs sampling. To demonstrate the advantages of MR-Corr2 over existing methods, we conducted comprehensive simulation studies to compare for both type-I error control and point estimates in various scenarios. By applying MR-Corr2 to study the relationships between exposure–outcome pairs in complex traits, we did not identify the contradictory causal relationship between HDL-c and CAD. Moreover, the results provide a new perspective of the causal network among complex traits.

Cheng Qs, Yang Ys, Shi Xs, Yeung Ks, Yang C, Peng H, Liu J*. MR-LDP: a two-sample Mendelian randomization for GWAS summary statistics accounting for linkage disequilibrium and horizontal pleiotropy. NAR Genomics and Bioinformatics, 2020, 2(2): lqaa028. [software]

       The proliferation of genome-wide association studies (GWAS) has prompted the use of two-sample Mendelian randomization (MR) with genetic variants as instrumental variables (IVs) for drawing reliable causal relationships between health risk factors and disease outcomes. However, the unique features of GWAS demand that MR methods account for both linkage disequilibrium (LD) and ubiquitously existing horizontal pleiotropy among complex traits, which is the phenomenon wherein a variant affects the outcome through mechanisms other than exclusively through the exposure. Therefore, statistical methods that fail to consider LD and horizontal pleiotropy can lead to biased estimates and false-positive causal relationships. To overcome these limitations, we proposed a probabilistic model for MR analysis in identifying the causal effects between risk factors and disease outcomes using GWAS summary statistics in the presence of LD and to properly account for horizontal pleiotropy among genetic variants (MR-LDP) and develop a computationally efficient algorithm to make the causal inference. We then conducted comprehensive simulation studies to demonstrate the advantages of MR-LDP over the existing methods. Moreover, we used two real exposure–outcome pairs to validate the results from MR-LDP compared with alternative methods, showing that our method is more efficient in using all-instrumental variants in LD. By further applying MR-LDP to lipid traits and body mass index (BMI) as risk factors for complex diseases, we identified multiple pairs of significant causal relationships, including a protective effect of high-density lipoprotein cholesterol on peripheral vascular disease and a positive causal effect of BMI on hemorrhoids.

3. Machine learning/statistical methods

       Building a foundation in statistics and machine learning techniques. Ultra-high-dimensional mixed-type variable data is becoming increasingly common in various fields, such as the data collected by multimodal sequencing technologies in genomics, including count-based gene expression data, binary chromatin accessibility data, and standardized continuous protein marker data. Developing new statistical and machine learning methods to integrate multiple modal data types and then fully utilize the information from each mode can provide better insights into the common characteristics of cells in each mode and identify the sources of heterogeneity in tissues.

 2. Design a deep generative method for sampling from conditional distributions, based on a unified formula for conditional distributions and employing a generalized non-parametric regression function using the noise-contrastive estimation lemma.

 3. We propose a deep dimension reduction approach to learning representations with the properties of sufficiency, low dimensionality, and disentanglement.

4. Transcriptome-wide association studies

Shi Xs, Chai Xs, Yang Ys, Cheng Qs, Jiao Y, ... & Liu J*. A tissue-specific collaborative mixed model for jointly analyzing multiple tissues in transcriptome-wide association studies. Nucleic Acids Research, 2020, 48(19): e109-e109. 

       Transcriptome-wide association studies (TWASs) integrate expression quantitative trait loci (eQTLs) studies with genome-wide association studies (GWASs) to prioritize candidate target genes for complex traits. Several statistical methods have been recently proposed to improve the performance of TWASs in gene prioritization by integrating the expression regulatory information imputed from multiple tissues, and made significant achievements in improving the ability to detect gene-trait associations. Unfortunately, most existing multi-tissue methods focus on prioritization of candidate genes, and cannot directly infer the specific functional effects of candidate genes across different tissues. Here, we propose a tissue-specific collaborative mixed model (TisCoMM) for TWASs, leveraging the co-regulation of genetic variations across different tissues explicitly via a unified probabilistic model. TisCoMM not only performs hypothesis testing to prioritize gene-trait associations, but also detects the tissue-specific role of candidate target genes in complex traits. To make full use of widely available GWASs summary statistics, we extend TisCoMM to use summary-level data, namely, TisCoMM-S2. Using extensive simulation studies, we show that type I error is controlled at the nominal level, the statistical power of identifying associated genes is greatly improved, and the false-positive rate (FPR) for non-causal tissues is well controlled at decent levels. We further illustrate the benefits of our methods in applications to summary-level GWASs data of 33 complex traits. Notably, apart from better identifying potential trait-associated genes, we can elucidate the tissue-specific role of candidate target genes. The follow-up pathway analysis from tissue-specific genes for asthma shows that the immune system plays an essential function for asthma development in both thyroid and lung tissues.

5. Studies in statistical genetics

Lu Y, Oliva M, Pierce B, Liu J*, Chen L*. Integrative cross-omics and cross-context analysis elucidates molecular links underlying genetic effects on complex traits, Nature Communications, 2024, 15(1),2383. [software]

       Motivated by a multi-tissue multiomics analysis using genetics, methylome and transcriptome data from the GenotypeTissue Expression (GTEx) project, we propose a method, X-ING (Cross-INtegrative Genomics), for cross-omics and cross-context integrative analysis of summarylevel data. X-ING takes as input the statistic matrices from multiple omics studies, each with multivariate contexts. It models the latent binary association status of each statistic, and captures the omics-shared and context-shared major patterns in a hierarchical Bayesian model. Via the modeling of latent binary status, X-ING enables the cross-feature integration of effects from different effect distributions. The analysis of cis-genetic effects on methylome and transcriptome from GTEx characterizes the tissue- and omics-effect sharing patterns. The analysis of trans-genetic effects demonstrates enrichment of transassociations in many disease/trait-relevant tissues. Many associations identified by X-ING are replicated using external data, with higher replication rates for multi-tissue or multi-omics effects.

Shi Xs, Jiao Y, Yang Ys, Cheng CY, Yang C, Lin X*, Liu J*. VIMCO: variational inference for multiple correlated outcomes in
genome-wide association studies.
Bioinformatics, 2019, 35(19):3693-3700. [software]

       We propose a novel method, Variational Inference for Multiple Correlated Outcomes (VIMCO) that focuses on identifying the specific trait that is associated with the genetic loci, when performing a joint GWAS analysis of multiple traits, while accounting for correlation among the multiple traits. We performed extensive numerical studies and also applied VIMCO to analyze two datasets. The numerical studies and real data analysis demonstrate that VIMCO improves statistical power over single-trait analysis strategies when the multiple traits are correlated and has comparable performance when the traits are not correlated.

Liu J*, Wan X, Wang C, Yang C, Zhou X, Yang C. LLR: a latent low-rank approach to colocalizing genetic risk variants in multiple GWAS.  Bioinformatics, 2017, 33(24):3878-86. [software]

       We propose a latent low-rank (LLR) approach to colocalizing genetic risk variants using summary statistics. In the presence of pleiotropy, there exist risk loci that affect multiple phenotypes. To leverage pleiotropy, we introduce a low-rank structure to modulate the probabilities of the latent association statuses between loci and phenotypes. Regarding the computational efficiency of LLR, a novel expectation-maximization-path (EM-path) algorithm has been developed to greatly reduce the computational cost and facilitate model selection and inference. We demonstrate the advantages of LLR over competing approaches through simulation studies and joint analysis of 18 GWAS datasets.

Liu J*, Huang J, Ma S, Wang K. Incorporating group correlations in genome-wide association studies using smoothed group Lasso. Biostatistics. 2012, 14(2):205-19.

       We propose a new penalization method for group variable selection which can properly accommodate the correlation between adjacent groups. This method is based on a combination of the group Lasso penalty and a quadratic penalty on the difference of regression coefficients of adjacent groups. The new method is referred to as smoothed group Lasso (SGL). It encourages group sparsity and smoothes regression coefficients for adjacent groups. Canonical correlations are applied to the weights between groups in the quadratic difference penalty. We first derive a GCD algorithm for computing the solution path with linear regression model. The SGL method is further extended to logistic regression for binary response. With the assistance of the majorize–minimization algorithm, the SGL penalized logistic regression turns out to be an iteratively penalized least-square problem.