Dirichlet process mixture models for clustering with applications to flow-cytometry and transcriptomics
Boris Hejblum, Inserm U1219 Bordeaux Population Health research center.
28th of November 2025
Abstract:
Dirichlet process mixture models (DPMMs) offer a powerful Bayesian nonparametric framework for clustering high-dimensional biological data, providing the flexibility to infer the number of clusters directly from the data. This is particularly valuable in modern flow-cytometry where technological advances allow the measurement of hundred of thousands of single cells at once, or in RNA-seq data featuring thousands of gene expressions. Such datasets are inherently heterogeneous, often exhibit non-Gaussian structure, and challenge traditional parametric clustering methods that rely on pre-specifying the number of cell populations in flow cytometry data of the number of biological groups from transcriptomics. To address these complexities, we aim to develop scalable inference for adaptive DPMMs. We propose a Dirichlet process mixture of multivariate skew-t distributions, enabling robust modeling of asymmetric and heavy-tailed cell population shapes in flow-cytometry data. Inference is performed via an efficient slice-sampling–based Markov chain Monte Carlo algorithm, coupled with a sequential strategy that propagates posterior information across longitudinal measurements. Unfortunately, this MCMC approach fails to scale to the thousands dimension of transcriptomics where it becomes computationally prohibitive. Therefore we also introduce a new collapsed variational inference procedure for Gaussian DPMMs with unknown covariance structure and adaptive estimation of the concentration parameter. Using a stick-breaking representation and weakly informative priors, this approach provides fast convergence while preserving key hierarchical dependencies typically lost in mean-field approximations. Both inference methods achieve strong performance in numerical simulations. Better suited to flow-cytometry, our sequential skew-t DPMM outperforms existing tools on benchmark datasets and enables improved characterization of longitudinal immune responses in the DALIA-1 HIV therapeutic vaccine trial. Our collapsed variational inference of Gaussian DPMMs accurately recovers all known leukemia subtypes from a reduced signature of an initial 2,194 gene expression profiles across 72 patients, and can additionally identifies a novel biologically meaningful sub-cluster with the larger signature.