2022 IMS International Conference on Statistics and Data Science (ICSDS)

December 13-16, 2022, Florence, Italy







Plenary Speakers

Conformal Prediction in 2022

Emmanuel Candès Stanford University

December 13, 2022

Conformal inference methods are becoming all the rage in academia and industry alike. In a nutshell, these methods deliver exact prediction intervals for future observations without making any distributional assumption whatsoever other than having iid, and more generally, exchangeable data. This talk will review the basic principles underlying conformal inference and survey some major contributions that have occurred in the last 2-3 years or. We will discuss enhanced conformity scores applicable to quantitative as well as categorical labels. We will also survey novel methods which deal with situations, where the distribution of observations can shift drastically — think of finance or economics where market behavior can change over time in response to new legislation or major world events, or public health where changes occur because of geography and/or policies. All along, we shall illustrate the methods with examples including the prediction of election results or COVID19-case trajectories.


Multiple Randomization Designs

Guido Imbens Stanford University

December 14, 2022

In this talk I will discuss a new class of experimental designs, Multiple Randomization Designs. In a classical randomized controlled trial (RCT), or A/B test, a randomly selected subset of a population of units (e.g., individuals, plots of land, or experiences) is assigned to a treatment (treatment A), and the remainder of the population is assigned to the control treatment (treatment B). The difference

in average outcome by treatment group is an estimate of the average effect of the treatment. However, motivating this talk, the setting for modern experiments is often different, with the outcomes and treatment assignments indexed by multiple populations. For example, outcomes may be indexed by buyers and sellers, by content creators and subscribers, by drivers and riders, or by travelers and airlines and travel agents, with treatments potentially varying across these indices. Spillovers or interference can arise from interactions between units across populations. For example, sellers' behavior may depend on buyers' treatment assignment, or vice versa. This can invalidate the simple comparison of means as an estimator for the average effect of the treatment in classical RCTs. I discuss new experimental designs for settings in which multiple populations interact. I show how these designs allow us to study questions about interference that cannot be answered by classical randomized experiments. Finally, I discuss new statistical methods for analyzing these Multiple Randomization Designs.

Inference for Longitudinal Data After Adaptive Sampling

Susan Murphy Harvard University

December 15, 2022

Adaptive sampling methods, such as reinforcement learning (RL) and bandit algorithms, are increasingly used for the real-time personalization of interventions in digital applications like mobile health and education. As a result, there is a need to be able to use the resulting adaptively collected user data to address a variety of inferential questions, including questions about time-varying causal effects. However, current methods for statistical inference on such data (a) make strong assumptions regarding the environment dynamics, e.g., assume the longitudinal data follows a Markovian process, or (b) require data to be collected with one adaptive sampling algorithm per user, which excludes algorithms that learn to select actions using data collected from multiple users. These are major obstacles preventing the use of adaptive sampling algorithms more widely in practice. In this work, we proved statistical inference for the common Z-estimator based on adaptively sampled data. The inference is valid even when observations are non-stationary and highly dependent over time, and allow the online adaptive sampling algorithm to learn using the data of all users. Furthermore, our inference method is robust to miss-specification of the reward models used by the adaptive sampling algorithm. This work is motivated by our work in designing the Oralytics oral health clinical trial in which an RL adaptive sampling algorithm will be used to select treatments, yet valid statistical inference is essential for conducting primary data analyses after the trial is over.


Scaling up Bayesian Modeling and Computation for real-world biomedical and public health applications

Sylvia Richardson University of Cambridge

December 16, 2022


The fast expansion of biomedical data resources is underpinning advances in medical research. However, it has brought a number of challenges for Bayesian inferential approaches. The “large n” data setting, such as encountered in the analysis of large cohorts or electronic health records, often creates computational bottlenecks, precluding model search endeavours. The “large p” setting, inherent to the modelling of high-dimensional data arising, for example, from the development of precision medicine strategies and new techniques to probe biomolecular mechanisms, can make joint analysis unreliable or intractable. Public health emergencies, like the Covid-19 pandemic, have shown the value of performing data synthesis at pace to carry out disease tracking. These varied contexts call for combining Bayesian hierarchical modelling with scalable approximate algorithms capable of producing accurate and robust inferences.


In this talk, I will first discuss the adaptation of the divide-and-conquer approaches for large n to the inferential context of model choice and of mixture models – an adaptation which goes beyond the well-established divide-and-conquer approaches developed for posterior inference on a chosen model with fixed number of parameters. I will next introduce some current analysis needs in biomedicine and discuss modelling and computational strategies for implementing joint regression modelling of a large numbers p of features and responses, and for joint network analyses. In both cases, information is borrowed through suitable hierarchical formulations. If time permits, I will end with a brief discussion of the challenges to conventional statistical and data science practice brought into focus by health surveillance in the recent pandemic.