Research opportunities are available for undergraduate, MSc, and PhD students in Statistics, Biostatistics, and Data Science related projects. Interested students are welcome to contact me via email at: sanjeena dot dang at carleton dot ca.

Microbiome Data

Our skin, mouth, and gut are host to a tremendous diversity of bacteria, archaea, fungi, and viruses collectively known as the microbiome. Studies have demonstrated imbalance in the microbiota for individuals with several diseases, specifically, a reduction and increased temporal instability in the microbial diversity as compared to healthy individuals. Emerging therapeutic approaches are targeting microbiome biomarkers, aiming to restore the microbial diversity and functionality that is lost in patients with disease. Our research group has developed various model-based clustering frameworks for clustering microbiome data and extended it for high dimensional data and for incorporating biological/environmental covariates (factors), e.g., diet, treatments, sex, and age.

Image Source: Fang and Subedi (2020)

Multivariate Discrete Data

Multivariate count data are commonly encountered through high-throughput se- quencing technologies in bioinformatics, text mining, or in sports analytics. Although the Poisson distribution seems a natural fit to these count data, its multivariate extension is computationally expensive. Our research group has developed mixtures of multivariate Poisson- lognormal (MPLN) distributions for clustering multivariate count measurements with a dependence structure. The MPLN distribution can account for over-dispersion as opposed to the traditional Poisson distribution and allows for correlation between the variables. These models have been used to cluster RNA-seq data and extended for matrix variate data.

Image Source: Revised version of Silva et al. (2018)


Biclustering is used for simultaneous clustering of the observations and variables when there is no group structure known a priori. Traditional clustering algorithms aim to group observations based on similarities across all variables at the same time. This can be too restrictive as observations may be similar under some variables, but different for others. Additionally, identifying groups of variables that behave similarly among different clusters can provide valuable information. For example, suppose Genes X, Y, and Z have a moderate to strong positive correlation in the diseased subgroup whereas it may have weak to no correlation in a healthy subpopulation. Identifying and characterizing such differences may provide key information to gain a comprehensive understanding of the differences in the underlying disease-development related pathways. Our research group has developed model-based biclustering frameworks for continous and compositional data.

Image source: Tu and Subedi (2020)

Multi-view Data Integration

Multi-view datasets are becoming increasingly common in bioinformatics. In such datasets, measurements on the same set of individuals are collected from different sources/platforms. Each individual dataset provides a unique but partial view of an underlying biological process. Different omics datasets such as genomics, transcriptomics, microbiomics, metabolomics, etc. target different aspects of biological processes. For example, genomic data provides information on the variations in the DNA sequences of genes while transcriptomics data provides a measure of the expression levels of those genes. Integrating these multi-view omics datasets is challenging because of the heterogeneity in the data types. On-going work focuses on developing efficient and scalable models for integrative clustering of high dimensional multi-view datasets that provide a comprehensive understanding of complex biological systems.

Image Source: Hasin et al. (2017)

High Dimensional Data

In bioinformatics, the datasets typically have many attributes (high-dimensionality) but the number of samples (or individuals) is often much smaller. Hence, clustering high dimensional data efficiently has been an area of great interest in the field. High dimensional mixture models tend to be highly parameterized. When the models are highly parameterized, the estimates tend to be less reliable especially when the sample size is low. Our research group has developed cluster-weighted factor analyzers using mixtures of Gaussian distributions which are a family of regression-based mixture models for high dimensional data. Using a factor analysis structure on the predictor variable and restricting the number of factors to be sufficiently small, the number of parameters that needs to be estimated is greatly reduced. We then extended this framework further using mixtures of multivariate t-distributions, a distribution that is more robust to outliers than a Gaussian distribution.

Image Source: Subedi et al. (2013)

Skewed Data

Non-Gaussian mixture models are gaining increasing attention for mixture model- based clustering particularly when dealing with data that exhibit features such as skewness and heavy tails. One such mixture model is mixture of multivariate normal inverse Gaussian (MNIG) distributions; these models have the flexibility to represent both skewed and symmetric populations as well as mixtures thereof. Our research group has developed various framework for parameter estimation for mixtures of MNIG distributions: a variational approximation based approach and a fully Bayesian approach. We have also developed an infinite mixtures of MNIG distributions that infers the number of components along with the parameter estimation.

Image source: Fang et al. (2020)