CREST-CMAP Statistics Seminar

The seminar is normally held on Mondays at 14:00 (2:00 PM CET)

Most sessions are held in-person (usually in room 3001), except those that are remote-only (for which Zoom links are circulated in the mailing list).

Organizing committee

Anna Korba (CREST, ENSAE)

Karim Lounici (CMAP, École polytechnique)

Jaouad Mourtada (CREST, ENSAE)

Upcoming sessions


Estimation of Functionals of High-Dimensional Parameters: Bias Reduction and Concentration

Monday, May 6; Monday, May 13; Thursday, May 16; Thursday, May 23; 1:00-4:15pm (in person)

The aim of this course will be on a circle of problems related to estimation of real valued functionals of parameters of high-dimensional statistical models. In such problems, it is of interest to estimate one-dimensional features of a high-dimensional parameter that are often represented by nonlinear functionals of certain degree of smoothness defined on the parameter space. The functionals of interest could be estimated with faster convergence rates than the whole parameter (sometimes, even with parametric rates). The examples include, for instance, such problems as estimation of linear functionals of principal components (that are nonlinear functionals of unknown covariance) in high-dimensional PCA. The goal is to discuss several mathematical methods that provide a way to develop estimators of functionals of highdimensional parameters with optimal error rates in classes of functionals of some Hölder smoothness. Full description here


Nicolas Schreuder (CNRS, Université Gustave Eiffel)



Past sessions

Past sessions from 2023-2024 (see other pages on this website for previous years):


Andrea Celli (Bocconi University)

No regret Learning in Bilateral Trade via Global Budget Balance

Bilateral trade revolves around the challenge of facilitating transactions between two strategic agents — a seller and a buyer — both of whom have a private valuations for the item. We study the online version of the problem, in which at each time step a new seller and buyer arrive. The learner’s task is to set a price for each agent, without any knowledge about their valuations. The sequence of sellers and buyers is chosen by an oblivious adversary. In this setting, known negative results rule out the possibility of designing algorithms with sublinear regret when the learner has to guarantee budget balance for each iteration. We introduce the notion of global budget balance, which requires the agent to be budget balanced only over the entire time horizon. By requiring global budget balance, we provide the first no-regret algorithms for bilateral trade with adversarial inputs under various feedback models (full feedback and one-bit feedback). We complement these results with a nearly-matching lower bound.


Matey Neykov (Northwestern University)

Some Insights in Nonparametric Conditional Independence Testing

In this talk we discuss some recent developments in nonparametric conditional independence testing. Specifically if X, Y, Z are three real valued random variables, of bounded support, with Z being continuous, we develop a minimax optimal test which determines whether X is independent of Y given Z under certain smoothness assumptions. Unfortunately, the minimax optimal test which is based on binning Z, depends on unknown constants. In order to calibrate it we propose a local permutation procedure which permutes samples whose corresponding Z values fall into the same bin. Despite its simplicity and empirical support, the theoretical underpinnings of the local permutation test are unclear. To that end we establish theoretical foundations of local permutation tests with a particular focus on binning-based statistics. In particular we derive some sufficient conditions under which the type I error of our test is controlled below a desired level and we show that in some cases our test calibrated through local permutation can achieve minimax optimal power.


Hannes Leeb (University of Vienna)

Conditional Predictive Inference for Stable Algorithms

We investigate generically applicable and intuitively appealing prediction intervals based on k-fold cross-validation. We focus on the conditional coverage probability of the proposed intervals, given the observations in the training sample (hence, training conditional validity), and show that it is close to the nominal level, in an appropriate sense, provided that the underlying algorithm used for computing point predictions is sufficiently stable when feature-response pairs are omitted. Our results are based on a finite sample analysis of the empirical distribution function of k-fold cross-validation residuals and hold in nonparametric settings with only minimal assumptions on the error distribution. To illustrate our results, we also apply them to high-dimensional linear predictors, where we obtain uniform asymptotic training conditional validity as both sample size and dimension tend to infinity at the same rate and consistent parameter estimation typically fails. These results show that despite the serious problems of resampling procedures for inference on the unknown parameters, cross-validation methods can be successfully applied to obtain reliable predictive inference even in high dimensions and conditionally on the training data.


Anna Korba (CREST, ENSAE)

Sampling with Mollified Interaction Energy Descent

Sampling from a target measure whose density is only known up to a normalization constant is a fundamental problem in computational statistics and machine learning. We present a new optimization-based method for sampling called mollified interaction energy descent (MIED), that minimizes an energy on probability measures called mollified interaction energy (MIE). The latter converges to the chi-square divergence with respect to the target measure and the gradient flow of the MIE agrees with that of the chi-square divergence, as the mollifiers approach Dirac deltas. Optimizing this energy with proper discretization yields a practical first-order particle-based algorithm for sampling in both unconstrained and constrained domains. We show the performance of our algorithm on both unconstrained and constrained sampling in comparison to state-of-the-art alternatives.


Zacharie Naulet (Université Paris-Saclay)

Frontiers to the learning (and clustering) of Hidden Markov Models

Hidden Markov models (HMMs) are flexible tools for clustering dependent data coming from unknown populations, allowing nonparametric identifiability of the population densities when the data is « truly » dependent. In the first part of the talk, I will talk about our result characterizing the frontier between learnable and unlearnable two-state nonparametric HMMs in term of a suitable notion of « distance » to independence. I will present surprising new phenomena emerging in the nonparametric setting. In particular, it is possible to « borrow strength » from the estimator of the smoothest density to improve the estimation of the other.  We conduct a precise analysis of minimax rates, showing a transition depending on compared smoothnesses of the emission densities. In the (short) second part, I will present our result on clustering HMM data. We compute upper and lower bounds on the Bayes risk of clustering and we identify the key quantity determining the difficulty of the clustering task.

First part is joint work with Kweku Abraham and Elisabeth Gassiat, second part is joint work with Elisabeth Gassiat and Ibrahim Kaddouri.


Anindya De (University of Pennsylvania)

Testing convex truncation

We study the basic statistical problem of testing whether normally distributed n-dimensional data has been truncated, i.e. altered by only retaining points that lie in some unknown truncation set S. As our main algorithmic results, (1) We give a computationally efficient O(n)-sample algorithm that can distinguish the standard normal distribution from the normal conditioned on an unknown and arbitrary convex set S. (2) We give a different computationally efficient O(n)-sample algorithm that can distinguish the standard normal distribution from the normal conditioned on an unknown and arbitrary mixture of symmetric convex sets.

These results stand in sharp contrast with known results for learning or testing convex bodies with respect to the normal distribution or learning convex-truncated normal distributions, where state-of-the-art algorithms require essentially n^{O(sqrt{n})} samples. An easy argument shows that no finite number of samples suffices to distinguish the normal from an unknown and arbitrary mixture of general (not necessarily symmetric) convex sets, so no common generalization of results (1) and (2) above is possible. We also prove lower bounds on the sample complexity of distinguishing algorithms (computationally efficient or otherwise) for various classes of convex truncations; in some cases these lower bounds match our algorithms up to logarithmic or even constant factors. Based on joint work with Shivam Nadimpalli and Rocco Servedio.


Eugène Ndiaye (Apple Research)

Finite Sample Confidence Sets with "Minimal" Assumptions on the Distribution 

If you predict a label y of a new object with $\hat{y}$, how confident are you that ”y = $\hat{y}$”? The conformal prediction method provides an elegant framework for answering such a question by establishing a confidence set for an unobserved response of a feature vector based on previous similar observations of responses and features. This is performed without assumptions about the distribution of the data. While providing strong coverage guarantees, computing conformal prediction sets requires adjusting a predictive model to an augmented dataset considering all possible values that the unobserved response can take, and proceeding to select the most likely ones. For a regression problem where y is a continuous variable, it typically requires an infinite number of model fits; which is usually infeasible. By assuming a little more regularity in the underlying prediction models, I will describe some of the techniques that make the calculations feasible. Along similar lines, it can be assumed that we are working with a parametric model that explains the relation between input and output variables. Consequently, a natural question arises as to whether a confidence interval on the ground truth parameter of the model can be constructed, also without assumptions on the distribution of the data. In this presentation, I will provide a partial answer to the question with some preliminary results. This is mostly an ongoing project, with still a lot of open questions that I am eager to discuss and share insights for future work.


Clément Bonet (CREST, ENSAE)

Sliced-Wasserstein Distances on Cartan-Hadamard Manifolds

While many Machine Learning methods were developed or transposed on Riemannian manifolds to tackle data with known non Euclidean geometry, Optimal Transport (OT) methods on such spaces have not received much attention. The main OT tool on these spaces is the Wasserstein distance which suffers from a heavy computational burden. On Euclidean spaces, a popular alternative is the Sliced-Wasserstein distance, which leverages a closed-form solution of the Wasserstein distance in one dimension, but which is not readily available on manifolds. In this work, we derive general constructions of Sliced-Wasserstein distances on Hadamard manifolds, Riemannian manifolds with non-positive curvature, which include among others hyperbolic spaces or the space of symmetric positive definite matrices. Additionally, we derive non-parametric schemes to minimize these new distances by approximating their Wasserstein gradient flows.


Gabriel Peyré (École Normale Supérieure)

Conservation Laws for Gradient Flows

Understanding the geometric properties of gradient descent dynamics is a key ingredient in deciphering the recent success of very large machine learning models. A striking observation is that trained over-parameterized models retain some properties of the optimization initialization. This "implicit bias" is believed to be responsible for some favorable properties of the trained models and could explain their good generalization properties. In this talk I will first rigorously expose the definition and basic properties of "conservation laws", which are maximal sets of independent quantities conserved during gradient flows of a given model (e.g. of a ReLU network with a given architecture) with any training data and any loss. Then I will explain how to find the exact number of these quantities by performing finite-dimensional algebraic manipulations on the Lie algebra generated by the Jacobian of the model. In the specific case of linear and ReLu networks, this procedure recovers the conservation laws known in the literature, and prove that there are no other laws. The associated paper can be found here https://arxiv.org/abs/2307.00144 and the open source code is here. This is a joint work with Sibylle Marcotte and Rémi Gribonval.


Anya Katsevich (MIT)

The Laplace approximation in high-dimensional Bayesian inference

Computing integrals against a high-dimensional posterior is the major computational bottleneck in Bayesian inference. A popular technique to reduce this computational burden is to use the Laplace approximation, a Gaussian distribution, in place of the true posterior. Despite its widespread use, the Laplace approximation's accuracy in high dimensions is not well understood. The body of existing results does not form a cohesive theory, leaving open important questions e.g. on the dimension dependence of the approximation rate. We address many of these questions through the unified framework of a new, leading order asymptotic decomposition of high-dimensional Laplace integrals. In particular, we (1) determine the tight dimension dependence of the approximation error, leading to the tightest known Bernstein von Mises result on the asymptotic normality of the posterior, and (2) derive a simple correction to this Gaussian distribution to obtain a higher-order accurate approximation to the posterior.


Emmanuel Pilliat (Université de Montpellier)

Ranking the rows of a Permuted Isotonic Matrix Optimally and in Polynomial Time

In the context of crowdsourcing, a natural question is how accurately we can determine the ranking of experts (or workers) who are labelling some data. The same kind of question arises in tournaments, where we might want to rank players based on the outcomes of games between pairs of players. We will consider a ranking problem where we have noisy observations from a matrix with isotonic columns whose rows have been permuted by some permutation π*. This encompasses many models, including crowd-labeling and ranking in tournaments by pairwise comparisons. After the introduction of the statistical model and of the risk measures, we will discuss the ideas for an optimal and polynomial-time procedure for recovering π*, settling an open problem in Flammarion et al. (2019).  The presented approach is based on iterative pairwise comparisons by suitable data-driven weighted means of the columns.


An Introduction to Conformal Prediction and Distribution-Free Inference

Monday, March 18 and Thursday, March 21, 1:00-4:15pm (in person)

This short course will introduce the framework of distribution-free statistical inference, and will provide an in-depth overview of both theoretical foundations and practical methodologies in this field. We will cover methods including holdout set methods, conformal prediction, cross-validation based methods, calibration procedures, and more, with emphasis on how these methods can be adapted to a range of settings to achieve robust uncertainty quantification without compromising on accuracy. The course will also introduce the theoretical results behind these methods, including the role of exchangeability (and its variants) in establishing the distribution-free validity of these methods, as well as more classical theoretical results establishing how these distribution-free methods relate to the answers we would obtain via parametric models or other classical assumption-based techniques. Our theoretical overview will also cover hardness results that carve out the space of inference questions that are possible or impossible to answer within the distribution-free framework.


Vincent Divol (Université Paris Dauphine)

Entropic estimation of optimal transport maps in the semi-discrete case

We study the question of estimating the optimal transport map T between two distributions P and Q, based on i.i.d. samples from the two distributions. This problem has gained a lot of traction since J-C. Hütter and P. Rigollet first proposed an estimator in 2021, based on wavelet decompositions. However, all analyses so far crucially rely on the smoothness of the map T, whereas optimal transport maps T are known to be discontinuous in many practical cases (e.g. when P has a connected support and Q has a disconnected support). As a first step towards developing estimation procedures for discontinuous maps, we consider the important special case where the data distribution Q is a discrete measure supported on a finite number of points. We study a computationally efficient estimator of T based on entropic smoothing, and show that it attains optimal, dimension-free rates of convergence. As a byproduct of our analysis, we give a new stability result for the entropic transport map.


Antoine Maillard (ETH Zurich)

From the theory of disordered systems to the mathematics of data science

I will introduce a connection between statistical learning or high-dimensional inference, and models of statistical mechanics in which a large number of elementary units interact with each other in a disordered manner. This relationship is the basis of a vast mathematical program leveraging tools and insights from the physics of disordered systems to improve our understanding of high-dimensional probabilistic models, with applications e.g. in machine learning and constraint satisfaction problems. I will then detail concrete examples of this program by presenting recent advances concerning the phase retrieval problem, as well as a challenging open problem of high-dimensional probability and statistics known as the ellipsoid fitting conjecture.


Ludovic Stephan (EPFL)

Feature learning in two-layer neural networks with large gradient steps

Feature learning is an important mechanism of neural networks, and an integral part of their advantages over simpler (e.g. kernel) learning methods. In this talk, I will present how this phenomenon occurs in two-layer networks trained with large gradient steps, in which both the batch size and the learning rate grow polynomially with the dimension. In particular, we uncover an occurence of the so-called "staircase" property of learning, where important directions are learned sequentially at each new step.


Baptiste Goujaud (Ecole polytechnique)

Constructive approaches to the analyses of gradient methods for Machine Learning

In the current era marked by an unprecedented surge in available data and computational prowess, the field of machine learning, and more specifically deep learning, has witnessed an extraordinary evolution. Among the important pieces underlying this success, machine learning algorithms heavily rely on optimization techniques to tune their parameters and enhance their predictive capabilities.

Among the myriad of optimization approaches, first-order algorithms have emerged as cornerstones, demonstrating a remarkable balance between efficacy and computational efficiency. Crucially, the development of strong optimization theory is pivotal in unraveling the full potential of first-order optimization to account for the specificities of learning applications. Among the keys to the success of first-order algorithms in learning applications, momentum-based algorithms have proven their effectiveness by significantly accelerating training procedures. The conceptual foundation provided by optimization theory has enabled the formulation of momentum, turning theoretical insights into a powerful and widely adopted practical optimization tool.

In this presentation, we provide constructive approaches to pursue and accelerate the effort to develop a strong theoretical foundation for first-order algorithms, and we show how the understanding of general structures underlying performance analyses allows obtaining improved algorithms and analyses for a number of situations. Finally, we discuss how it relates to the fundamental goal of machine learning algorithms: obtaining certifiable models with good generalization properties.