Keynote Lectures
Conférences plénières
Probabilistic reasoning on modern hardware
Prof. Alexandre Bouchard-Côté
In the past 10 years, GPGPU (General-purpose computation on graphical processing units) and distributed computing are fuelling rapid progress in machine learning and statistics. These computing architecture can perform massive computations at a fraction of the cost and time that would be required on classic CPUs. However, GPUs and distributed computing require us to rethink how statistical computation is performed and may favour different estimators than in the CPU serial computing paradigm.
In this talk, Prof. Bouchard-Côté will cover some background on statistical computation in the context of GPUs, distributed CPUs, and multi-GPUs, focussing on probabilistic reasoning tasks. Prof. Bouchard-Côté will provide examples of success stories in the context of particle filtering and Monte Carlo methods, where adoption of GPUs and distributed computing was instrumental to statistical application in real world scientific problems.
Prof. Stef van Buuren
Missing data is a fact of life in real-world research - but ignoring it or handling it poorly can lead to biased results and flawed conclusions. This lecture introduces the concept of multiple imputation, a principled framework that replaces missing values with plausible alternatives, preserving uncertainty and improving inference. We take a deep dive into the R package mice, a widely used tool for flexible and transparent imputation workflows.
Beyond the basics, we explore two cutting-edge innovations: how to train, store, and reuse imputation models to boost efficiency and reproducibility, and how to scale imputation through parallel computation. These experimental extensions aim to bring multiple imputation closer to real-time, high-volume data challenges—without compromising statistical integrity.
The lecture offers practical tools and fresh perspectives on dealing with one of the trickiest parts of data analysis: the gaps.
Prof. Janie Coulombe
Longitudinal observational data are a gold mine of information on patients' health and the effect of different interventions (treatments, exposure, etc.) on their diseases and comorbidities. However, these data are typically observed irregularly between patients, with some patients (e.g., those living with a chronic disease) who are observed more often than others by their physician. It has been shown that irregular observations between patients, that depend on their characteristics, can affect causal inference by acting similarly as selection bias. Unfortunately, methods that address irregular observation times are still largely unknown and the problem still often ignored in statistics applications.
In this presentation, Prof. Coulombe will discuss the problem of irregular observation in causal inference, the questions related to irregular observation that have already been answered, and the challenges ahead. She will discuss different ongoing projects that aim to answer some of these questions and challenges. Different methods will be discussed, as well as theoretical and practical results and future directions.
Student Talks
Directional effects in latent structures of disease mapping models
Mariana Carmona-Baez
Understanding the spatial dynamics of infectious diseases often involves modeling disease counts as Poisson-distributed variables with spatially structured latent effects. Conditional Autoregressive (CAR) models are commonly used for this purpose but assume stationary spatial dependence, which may not hold in heterogeneous environments. Based on a model that captures residual spatio-temporal variation by defining a spatially discrete process with a correlation structure based on a spatially continuous Gaussian process, we propose a flexible framework that incorporates covariate information into the covariance structure of latent spatial effects, allowing for local variations in spatial dependence. To address computational challenges in Gaussian processes, we employ a Nearest Neighbor Gaussian Process for covariance approximation. Our model is applied to the weekly incidence of dengue fever in Rio de Janeiro during 2015, incorporating the proportion of green area in the covariance structure. Implemented in a Bayesian framework using Hamiltonian Monte Carlo, the model effectively captures complex spatio-temporal latent effects. This approach improves spatial modeling by integrating local covariates into the covariance structure while remaining computationally feasible for large datasets, providing a novel perspective on infectious disease dynamics.
Multi ancestry approaches for colocalization analysis
Cathy Shen
Genome wide association studies (GWAS) aim to identify associations between genetic variants and traits, particularly disease traits, such as type 2 diabetes (T2D). Although GWAS has facilitated many discoveries in disease biology, interpretability of GWAS results may be further enhanced by integrating functional genomics, such as expression quantitative loci studies (eQTL). Colocalization methods assess whether two association signals obtained from two phenotypes, such as a disease or a gene expression phenotype, share a common causal variant. However, current colocalization methods support only single ancestry analysis, and are unable to leverage differences in linkage disequilibrium (LD) between ancestry groups and greater sample sizes from multi-ancestry investigations to improve colocalization resolution. In this thesis, multi-ancestry fine-mapping methods SuSiEx and MsCAVIAR are integrated with colocalization methods coloc and eCAVIAR to perform multi-ancestry colocalization analysis. Specifically, SuSiEx and MsCAVIAR are each integrated with both coloc and eCAVIAR, resulting in four proposed approaches: coloc_SuSiEx, MseCAVIAR, SuSiEx_eCAVIAR and MsCAVIAR_coloc. A simulation study compares the performance of these approaches. In a multi-ancestry single causal variant setting (where one variant is associated with both traits across all ancestries), all four methods produced comparable credible sets in terms of size, and reported comparable conditional variant level colocalization posterior probabilities (CLPP) for the true causal variant. Under low power settings, the SuSiEx based colocalization methods (SuSiEx_eCAVIAR and coloc_SuSiEx) often struggled to identify the presence of a causal variant. When compared to the coloc based methods (coloc_SuSiEx and MsCAVIAR_coloc), the eCAVIAR based methods (MseCAVIAR and SuSiEx_eCAVIAR) reported lower CLPPs at the loci level. In the multi-ancestry multiple causal variant setting, the colocalization results across the proposed methods are not all comparable due to differences in the SuSiEx and MsCAVIAR constructions of credible sets. Finally, the proposed methods were applied to perform multi-ancestry colocalization analysis of T2D GWAS from the DIAMANTE Consortium with a protein quantitative trait loci (pQTL) study on European individuals in the INTERVAL study. This work addresses the increasing need for multi-ancestry colocalization methods as data from multi-ancestry GWAS become more widely available.
Identifying treatment effect heterogeneity with Bayesian hierarchical adjustable random partition (BHARP) in adaptive enrichment trials
Xianglin Zhao
In precision medicine, identifying sensitive population and guiding treatment decisions require investigating treatment effect heterogeneity via subgroup-specific responses and homogeneity patterns. However, comparing multiple interventions across subgroups is challenging. To improve power and precision, many Bayesian models partition subgroups for information borrowing, yet two challenges persist: capturing uncertainty in partitioning and adapting borrowing strength. We propose a flexible Bayesian hierarchical model with a finite mixture of variable number of components. For each intervention, subgroups are partitioned into clusters, borrowing information within each cluster. Using a reversible jump MCMC algorithm, it explores partitions while adjusting borrowing strength based on within-cluster variability. We also introduce a Bayesian adaptive enrichment design to merge equivalent subgroups, enrich responsive subgroups and terminate futile arms, improving efficiency and flexibility.
Evaluating real-time probabilistic forecasters for sports games outcome prediction
Chi-Kuang Yeh
Probabilistic forecasts (PF) are ubiquitous in modern society for predicting uncertain outcomes. Over time the number and scope of PF readily accessible to the public have increased at a steady pace, and now covers prediction of phenomena spanning various fields. In the past, the available information for making forecasts was fixed and unchanging throughout the decision-making process. Nowadays, many such forecasts are made initially well before the event in question occurs, and are then continuously updated (CU) as new information becomes available. Decision-makers rely heavily on these forecasts, so the forecast quality is crucial. In addition, model selection is also a challenging question. To address these issues, we develop new tools for measuring the quality of CUPFs, including a significance test and simple graphical summaries. Following description of the methodology, we present a summary of results from a comprehensive simulation study and an application to NBA game data.
Standardizing to target populations in multisite studies using inverse odds and augmented inverse probability weighting
Shiyao Tang
Distributed network studies assess treatment effects across heterogeneous populations by pooling data from multiple sites. We show how baseline covariate data from the entire cohort and treatment and outcome data from source sites can be used to estimate average treatment effects in target sites. We propose an inverse odds weighted (IOW) augmented inverse probability weighting (AIPW) framework, improving robustness in transportability settings. Unlike standard AIPW, IOW models source population membership from observational data, introducing a third model beyond the traditional doubly robust estimator. We assess finite-sample performance in simulations and find that correct specification of propensity score model leads to less biased estimates, while different model assumptions yield varied results. IOW-based standardization enhances precision and interpretability in multisite studies and helps explain differences in estimates across study segments.
Identifying cell density marker genes with a Bayesian hierarchical marked point process model
Mingchi Xu
Image-based spatially resolved transcriptomics offers a unique opportunity to explore the relationship between cell distribution and gene expression. However, rigorous statistical models that capture this relationship remain underexplored, primarily due to the inherent randomness of cell locations and the computational complexity involved. We assume cell occurrence and gene expression are partial realizations from a marked point process. In particular, a Bayesian hierarchical model is proposed such that gene expression depends on the spatial pattern of cells. Further, associations between the two processes and across the tissue are captured through a linear model of coregionalization. The inference procedure follows the Bayesian paradigm and efficient methods are proposed to approximate the resultant posterior distribution. Examples include artificial data and SRT data arising from mouse brain tissues, from which we identify cellular density-specific genes across different cell types
Bayesian outcome weighted learning
Sophia Yazzourh
One of the primary goals of statistical precision medicine is to learn optimal individualized treatment rules (ITRs). The classification-based, or machine learning-based, approach to estimating optimal ITRs was first introduced in outcome-weighted learning (OWL). OWL recasts the optimal ITR learning problem into a weighted classification problem, which can be solved using machine learning methods, e.g., support vector machines. In this paper, we introduce a Bayesian formulation of OWL. Starting from the OWL objective function, we generate a pseudo-likelihood which can be expressed as a scale mixture of normal distributions. A Gibbs sampling algorithm is developed to sample the posterior distribution of the parameters. In addition to providing a strategy for learning an optimal ITR, Bayesian OWL provides a natural, probabilistic approach to estimate uncertainty in ITR treatment recommendations themselves. We demonstrate the performance of our method through several simulation studies.
Disease mapping of Covid-19 incidence using digital mobility data in Toronto
Chi Zhang
This study examined how geospatial mobility influenced COVID-19 incidence across five pandemic waves in Toronto (March 2020–February 2022). Using anonymized commercial GPS data, we tracked monthly origin-destination travel between 94 forward sortation areas and linked this with PCR-confirmed COVID-19 cases. A Bayesian spatial model (Besag-York-Mollié) was applied, incorporating both traditional census data (e.g., median income) and mobility data with three spatial structures: (1) a traditional contiguity-based binary weighting structure; (2) a mobility-based binary weighting structure; and (3) a mobility-based continuous weighting structure. As a result, mobility was strongly associated with COVID-19 incidence across all waves in the mobility-adjusted spatial structures. For instance, during the first wave, estimates based on the mobility-based binary weighting structure indicated that a unit increase in mobility intensity is associated with a 73% (95% Credible Interval: 41%–112%) increase in relative risk (RR) for COVID-19 incidence. This association was stronger than that of traditional area-level census predictors of COVID-19.
An efficient Bayesian data augmentation algorithm for continuous-time stochastic epidemic models fit to partially observed incidence data
Haoyu Wu
Statistical modeling of infectious disease outbreaks with incidence data is crucial to understanding epidemic dynamics and informing public health interventions. Stochastic epidemic models (SEMs) account for the inherent stochasticity in epidemic evolution and are particularly useful in small- to medium-sized populations or low-prevalence settings. In SEMs, the epidemic is naturally represented as a continuous-time Markov jump process (MJP), while surveillance data are usually collected at discrete times. Fitting SEMs to partially observed epidemic trajectories requires integrating over all possible transition paths, which is often intractable. The inference task is further complicated by imperfect case detection. Individual-level data augmentation (DA) methods have been developed to address these challenges. However, existing approaches either adopted a persubject sampling strategy with limited scalability or relied on typically unrealistic assumptions to simplify the complex dimensional change problem for state-space SEMs. We present an efficient DA Markov chain Monte Carlo block sampling framework for fitting continuous-time state-space SEMs to incidence data subject to underdetection. We evaluate the statistical and computational performance of the proposed framework through simulations and apply it to infer the epidemic characteristics of past Ebola and H1N1 outbreaks.