Séance du 14 novembre 2022

Séance organisée par Emilie Lebarbier et Nicolas Marie sur la statistique en grande dimension.

Lieu : IHP, amphi Hermite


14.00 : Pierre Alquier (RIKEN AIP)

Titre : Concentration and robustness of discrepancy-based ABC via Rademacher complexity

Résumé : Classical implementations of approximate Bayesian computation (ABC) employ summary statistics to measure the discrepancy among the observed data and the synthetic samples generated from each proposed value of the parameter of interest. However, finding effective summaries is challenging for most of the complex models for which ABC is required. This issue has motivated a growing literature on summary-free versions of ABC that leverage the discrepancy among the empirical distributions of the observed and synthetic data, rather than focusing on selected summaries. The effectiveness of these solutions has led to an increasing interest in the properties of the corresponding ABC posteriors, with a focus on concentration and robustness in asymptotic regimes. Although recent contributions have made key advancements, current theory mostly relies on existence arguments which are not immediate to verify and often yield bounds that are not readily interpretable, thus limiting the methodological implications of theoretical results. In this talk, we address such aspects by developing a novel unified and constructive framework, based on the concept of Rademacher complexity, to study concentration and robustness of ABC posteriors within the general class of integral probability semimetrics (IPS), that includes routinely-implemented discrepancies such as Wasserstein distance and MMD, and naturally extends classical summary-based ABC. For rejection ABC based on the IPS class, we prove that the theoretical properties of the ABC posterior in terms of concentration and robustness directly relate to the asymptotic behavior of the Rademacher complexity of the class of functions associated to each discrepancy. This result yields a novel understanding of the practical performance of ABC with specific discrepancies, as shown also in empirical studies, and allows to develop new theory guiding ABC calibration.


15.00 : Stéphan Clemençon (Télécom Paris)

Titre : A Bipartite Ranking Approach to the Two-Sample Problem

Résumé : The problem of testing whether two independent i.i.d. random samples $\bX_1,\; \ldots,\; \bX_n$ and $\bY_1,\; \ldots,\; \bY_m$ are drawn from the same (unknown) probability distribution on a measurable space $\mathcal{Z}$ or not, usually referred to as the \textit{two-sample problem}, is ubiquitous. It finds applications in many areas, ranging from clinical trials to data attribute matching through psychometrics for instance. Its study in high-dimensional settings is the subject of much attention, in particular because the data acquisition processes at work in the Big Data era involve various sources of information in general and are often poorly controlled, leading to datasets possibly exhibiting a strong sampling biasthat may jeopardize their use for statistical learning purposes. In such situations, classic methods relying on the computation of a discrepancy measure between empirical versions of the distributions of $\bX$ and $\bY$ (\textit{e.g.} integral probability metrics), are naturally confronted with the curse of dimensionality. In this talk, we will explain how to develop an alternative approach extending \textit{rank tests}, known to be asymptotically optimal for univariate distributions when appropriately designed, to the multivariate setup. Overcoming the lack of natural order on $\mathbb{R}^d$ as soon as $d\geq 2$, the method proposed is implemented in two steps. It consists in dividing each of the two samples into two halves: a preorder on $\mathbb{R}^d$ defined by a real-valued scoring function is first learned by means of a bipartite ranking algorithm applied to the first halves of the samples and a rank test is applied next to the scores of the observations of the second halves in order to detect possible differences between their (univariate) distributions. Because it learns how to project the data onto the real line like (any monotone transform of) the likelihood ratio between the original multivariate distribution would do, the approach is not affected by the curse of dimensionality, ignoring ranking model bias issues of course, and preserves the asymptotic optimality of univariate rank tests, capable of detecting small departures from the null (homogeneity) assumption. Beyond a theoretical analysis establishing non asymptotic bounds for the two types of error of the method based on recent concentration results for two-sample linear $R$-processes, an extensive experimental study will be presented, showing that the approach promoted surpasses in performance alternative methods standing as natural competitors.

Keywords: Bipartite ranking ; Nonparametric statistical hypothesis testing ; Two-sample linear rank statistic/process ; Two-sample problem.


16.00 : Julien Chiquet (Université Paris-Saclay, AgroParisTech, INRAE)

Titre : Variational Inference in the Poisson-Lognormal Model: optimisation and estimation

Résumé : The multivariate Poisson-lognormal (PLN) model is a popular latent variable model commonly used to describe abundance tables, which relies on a Gaussian latent layer to encode the dependencies between count-valued variables in a covariance matrix. It can be viewed as a multivariate mixed Poisson regression model. The PLN model turns out to be a versatile framework, within which a variety of analyses can be performed, including multivariate sample comparison, clustering of sites or samples, dimension reduction (ordination) for visualization purposes, or inferring interaction networks. Inferring such models raises both statistical and computational issues, many of which were solved in recent contributions using variational techniques and convex optimization tools. I will present the general PLN framework and I will describe some recent advances in variational optimization of such models and the statistical properties of the associated estimators.