The seminar is normally held on Mondays at 14:00 (2:00 PM CET).
Most sessions are held in-person, usually in room 3001. Exceptionally, some sessions may be held remotely, in which case Zoom links are circulated in the mailing list.
The seminar is open; in order to be added to the mailing list, please send an email to Leyla Marzuk (first [dot] last [at] ensae [dot] fr).
Anna Korba (CREST, ENSAE)
Karim Lounici (CMAP, École polytechnique)
Jaouad Mourtada (CREST, ENSAE)
Monday, April 28, 2:00pm
Austin Stromme (ENSAE)
Asymptotic log-Sobolev constants and the Polyak-Łojasiewicz gradient domination condition
The Polyak-Łojasiewicz (PL) constant for a given function exactly characterizes the exponential rate of convergence of gradient flow uniformly over initializations, and has been of major recent interest in optimization and machine learning because it is strictly weaker than strong convexity yet implies many of the same results. In the world of sampling, the log-Sobolev inequality plays an analogous role, governing the convergence of Langevin dynamics from arbitrary initialization in Kullback-Leibler divergence. In this talk, we present a new connection between optimization and sampling by showing that the PL constant is exactly the low temperature limit of the re-scaled log-Sobolev constant, under mild assumptions. Based on joint work with Sinho Chewi.
Monday, May 12, 2:00pm
Yury Polyanskiy (MIT)
Optimal Quantization for LLMs and Matrix Multiplication
The main building block of large language models is matrix multiplication, which is often bottlenecked by the speed of loading these matrices from memory. A number of recent quantization algorithms (SmoothQuant, GPTQ, QuIP, SpinQuant etc) address this issue by storing matrices in lower precision. We derive optimal asymptotic information-theoretic tradeoff between accuracy of the matrix product and compression rate (number of bits per matrix entry). We also show that a non-asymptotic version of our construction (based on nested Gosset lattices and Conway-Sloan decoding), which we call NestQuant, reduces perplexity deterioration almost three-fold compared to the state-of-the-art algorithms (as measured on LLama-2, Llama-3 with 8B to 70B parameters). Based on a joint work with Or Ordentlich (HUJI), Eitan Porat and Semyon Savkin (MIT EECS).
Monday, May 26, 2:00pm
Emilie Kaufmann (CNRS & INRIA Lille)
Multi-objective bandits revisited
In multi-objective bandit models, arms are multi-variate distributions. Their expectations are thus vectors in Rd and there is no obvious notion of "best arm" as some arms may be better for some objective but worse for others. In this talk, I will present algorithms for the adaptive identification of the Pareto set of the arms means, in a fixed-confidence setting. These algorithms adaptively sample the arms and also adaptively stop the data collection process so as to guarantee an error at most δ for their guess of the Pareto set. I will present a first algorithm based on confidence regions that achieves a near-optimal sample complexity bound featuring some appropriate notions of "sub-optimality gaps". Then we will discuss asymptotically optimal algorithm, i.e. algorithms whose sample complexity is matching a lower bound in the regime in which the error probability is small.
This talk is based on joint works with Cyrille Koné, Laura Richert and Marc Jourdan. https://arxiv.org/abs/2307.00424 https://arxiv.org/abs/2411.04939
Past sessions
Past sessions from 2024-2025 (see other pages on this website for previous years):
Monday, September 9, 2:00pm (in person)
Etienne Boursier (INRIA)
Early alignment in two-layer networks training is a two-edged sword
Training neural networks with first order optimisation methods is at the core of the empirical success of deep learning. The scale of initialisation is a crucial factor, as small initialisations are generally associated to a feature learning regime, for which gradient descent is implicitly biased towards simple solutions. This work provides a general and quantitative description of the early alignment phase, originally introduced by Maennel et al. (2018) . For small initialisation and one hidden ReLU layer networks, the early stage of the training dynamics leads to an alignment of the neurons towards key directions. This alignment induces a sparse representation of the network, which is directly related to the implicit bias of gradient flow at convergence. This sparsity inducing alignment however comes at the expense of difficulties in minimising the training objective: we also provide a simple data example for which overparameterised networks fail to converge towards global minima and only converge to a spurious stationary point instead.
Monday, September 16, 2:00pm (in person)
Gil Kur (ETH Zurich)
Minimum norm interpolation meets the local theory of Banach spaces
Minimum-norm interpolators have recently gained attention primarily as an analyzable model to shed light on the double descent phenomenon observed for neural networks. The majority of the work has focused on analyzing interpolators in Hilbert spaces, where typically an effectively low-rank structure of the feature covariance prevents a large bias. More recently, tight vanishing bounds have also been shown for isotropic high-dimensional data for l_p-spaces with p∈[1,2), leveraging the sparse structure of the ground truth. This work takes a first step towards establishing a general framework that connects generalization properties of the interpolators to well-known concepts from high-dimensional geometry, specifically, from the local theory of Banach spaces.
Monday, October 7, 2:00pm (in person)
Damien Garreau (Julius-Maximilians-Universität Würzburg)
Are Ensembles Getting Better all the Time?
Ensemble methods combine the predictions of several base models. We study whether or not including more models always improves their average performance. This question depends on the kind of ensemble considered, as well as the predictive metric chosen. We focus on situations where all members of the ensemble are a priori expected to perform as well, which is the case of several popular methods such as random forests or deep ensembles. In this setting, we show that ensembles are getting better all the time if, and only if, the considered loss function is convex. When the loss function is nonconvex, we show a series of results that can be summarised as: ensembles of good models keep getting better, and ensembles of bad models keep getting worse.
Monday, November 4, 2:00pm (in person)
Marc Hoffmann (Université Paris-Dauphine)
Sur l’estimation d'une diffusion multidimensionnelle
Alors que l'estimation des coefficients d'une diffusion scalaire semblait bien comprise (minimax, adaptatif, bayésien non-paramétrique) au début des années 2000, alors que la statistique des semi-martingales s'est résolument tournée vers la finance statistique, ces dernières années voient réapparaître le problème de l'estimation du champ de vecteur de dérive et de la matrice de diffusion pour un processus de diffusion multivarié, notamment sous l'influence de questions de ML et de problèmes inverses bayésiens.
Dans cet exposé, issu de travaux en commun avec Chiara Amorino, Claudia Strauch et aussi Kolyan Ray, nous montrons que si l'on se contente d'un programme théorique non-paramétrique classique (perte L^2, minimax adaptatif à la Lepski, mais pas tellement plus), alors il est possible d'obtenir des résultats relativement généraux qui améliorent en dimension arbitraire ce que l'on connaît, et ceci dans plusieurs directions : pour (i) des observations en temps grand avec pas de discrétisation arbitrairement lent (ii) une réflexion du processus aux bords d'un domaine, mais pas forcément (iii) des situations où la diffusion peut dégénérer, ce qui permet d'inclure des modèles de type position-vitesse ; (iv) dans certains cas (conductivité, schémas rapides) des vitesses de contraction bayésiennes.
L'approche est toujours un peu la même : pour les bornes supérieures, construire une équivalence de modèle par un schéma de régression martingale, découpler les propriétés de concentration du bruit martingale de la "vitesse de remplissage" de l'espace par le "design" (souvent mal connue, ou tout au moins difficile à estimer) ; pour les bornes inférieures, des méthodes perturbatives utilisant un peu de calcul de Malliavin et pour les résultats bayésiens, plus fins, des développements en temps petit du noyau de la chaleur pour une "bonne" géométrie.
Monday, November 18, 2:00pm (in person)
Umut Şimşekli (INRIA)
Implicit Compressibility of Overparametrized Neural Networks Trained with Heavy-Tailed SGD
Neural network compression has been an increasingly important subject, not only due to its practical relevance, but also due to its theoretical implications, as there is an explicit connection between compressibility and generalization error. In this talk, I will present a simple modification for SGD, such that the outputs of the algorithm will be provably compressible without making any nontrivial assumptions. We will consider a one-hidden-layer neural network trained with SGD, and show that if we inject additive heavy-tailed noise to the iterates at each iteration, for any compression rate, there exists a level of overparametrization such that the output of the algorithm will be compressible with high probability. To achieve this result, we make two main technical contributions: (i) we prove a “propagation of chaos” result for a class of heavy-tailed stochastic differential equations, and (ii) we derive error estimates for their Euler discretization. Our experiments suggest that the proposed approach not only achieves increased compressibility with various models and datasets, but also leads to robust test performance under pruning, even in more realistic architectures that lie beyond our theoretical setting. The talk is based on the following article: https://arxiv.org/pdf/2306.08125.pdf
Thursday, November 28, 11:00am (amphi 200)
Jose Correa (Universidad de Chile)
Monotone Randomized Apportionment
Apportionment is the act of distributing the seats of a legislature among political parties (or states) in proportion to their vote shares (or populations). A famous impossibility by Balinski and Young (2001) shows that no apportionment method can be proportional up to one seat (quota) while also responding monotonically to changes in the votes (population monotonicity). Grimmett (2004) proposed to overcome this impossibility by randomizing the apportionment, which can achieve quota as well as perfect proportionality and monotonicity — at least in terms of the expected numberof seats awarded to each party. Still, the correlations between the seats awarded to different parties may exhibit bizarre non-monotonicities. When parties or voters care about joint events, such as whether a coalition of parties reaches a majority, these non-monotonicities can cause paradoxes including incentives for strategic voting.
In this paper, we propose monotonicity axioms ruling out these paradoxes, and study which of them can be satisfied jointly with Grimmett's axioms. Essentially, we require that, if a set of parties all receive more votes, the probability of those parties jointly receiving more seats should increase. Our work draws on a rich literature on unequal probability sampling in statistics (studied as dependent randomized rounding in computer science). Our main result shows that a sampling scheme due to Sampford (1967) satisfies Grimmett's axioms and a notion of higher-order correlation monotonicity. Authors : J. Correa, P. Gölz, U. Schmidt-Kraepelin, J. Tucker-Foltz, V. Verdugo
Monday, December 9, 2:00pm (in person)
Pragya Sur (Harvard)
Spectrum-Aware Debiasing: A Modern Inference Framework with Applications to Principal Components Regression
Debiasing methodologies have emerged as powerful tools for making statistical inferences in high-dimensional problems. Since its original introduction, the methodology underwent a major development with the introduction of debiasing techniques that adjust for degrees-of-freedom (Bellec and Zhang, 2019). While overcoming limitations of initial debiasing approaches, this updated method relies on Gaussian/sub-Gaussian tailed designs and independent, identically distributed samples – a key limitation. In this talk, I will propose a novel debiasing formula that breaks this barrier by exploiting the spectrum of the sample covariance matrix. Our formula applies to a much broader class of designs, including some heavy- tailed distributions, as well as certain dependent data settings. Our correction term differs significantly from prior work but recovers the Gaussian-based formula as a special case. Notably, our approach does not require estimating the high-dimensional population covariance matrix yet can account for certain classes of dependence among both features and samples. We demonstrate the utility of our method for several statistical inference problems. As a by-product, our work also introduces the first debiased principal component regression estimator with formal guarantees in high dimensions.
Monday, January 20, 2:00pm (in person)
Evgenii Chzhen (CNRS & Université Paris-Saclay)
Small total-cost constraints in contextual bandits with knapsacks (with some applications to fairness)
I will talks about some recent developments in the literature of contextual bandit problems with knapsacks [CBwK], a problem where at each round, a scalar reward is obtained and vector-valued costs are suffered. The goal is to maximize the cumulative rewards while ensuring that the cumulative costs are lower than some predetermined cost constraints. In this setting, total cost constraints had so far to be at least of order T^{3/4} where T is the number of rounds, and were even typically assumed to depend linearly on T. Elaborating on the main technical challenges and drawbacks of previous approaches, I will present a dual strategy based on projected-gradient-descent updates, that is able to deal with total-cost constraints of the order of T^{1/2} up to poly-logarithmic terms. This strategy is direct, and it relies on a careful, adaptive, tuning of the step size. The approach is inspired by a parameter-free-type algorithms arising from convex (online) optimization literature.
The talk is based on joint works with C. Giraud, Z. Li, and G. Stoltz.
Monday, January 27, 2:00pm (in person)
Judith Rousseau (Université Paris-Dauphine)
Convergence of Diffusion Models Under the Manifold Hypothesis in High-Dimensions
Denoising Diffusion Probabilistic Models (DDPM) are powerful state-of-the-art methods used to generate synthetic data from high-dimensional data distributions and are widely used for image, audio and video generation as well as many more applications in science and beyond. The manifold hypothesis states that high-dimensional data often lie on lower-dimensional manifolds within an ambient space of large dimension D, and is widely believed to hold in provided examples. While recent results have provided invaluable insight into how diffusion models adapt to the manifold hypothesis, they do not capture the great empirical success of these models.
In this work, we study DDPMs under the manifold hypothesis and prove that they achieve rates independent of the ambient dimension in terms of learning the score. In terms of sampling, we obtain rates independent of the ambient dimension w.r.t. the Kullback-Leibler divergence, and $O(\sqrt{D})$ w.r.t.\ the Wasserstein distance. We do this by developing a new framework connecting diffusion models to the well-studied theory of extrema of Gaussian Processes.
This is a joint work with I. Azangulov and G. Deligliannidis (Univ of Oxford).
Monday, February 3, 2:00pm (in person)
Joon Kwon (INRAE & Université Paris-Saclay)
A regret minimization approach to fixed point iterations
We present a link between regret bounds and fixed point problems with nonexpansive maps. This allows the definition of many new fixed point iterations based on regret minimizing algorithms with corresponding convergence guarantees. In particular, we transpose the celebrated AdaGrad algorithm to obtain a fixed point iteration with strong adaptive properties.
Monday, March 3; Wednesday, March 5; Monday, March 10; Wednesday, March 12
10:15am-1:15pm, room 3001
PhD course by Alexander Rakhlin (MIT)
On the foundations of interactive decision-making
In this course, we will study general principles for interactive decision making.
Monday, March 17 and Thursday, March 20; 9am-1pm
PhD course by Maxim Raginsky (University of Illinois Urbana-Champaign)
Information theory and statistical physics perspectives on empirical processes
Monday, March 24, 2:00pm
Solenne Gaucher (Ecole polytechnique)
Classification and regression under fairness constraints
Artificial intelligence (AI) is increasingly shaping the decisions that affect our lives—from hiring and education to healthcare and access to social services. While AI promises efficiency and objectivity, it also carries the risk of perpetuating and even amplifying societal biases embedded in the data used to train these systems. Algorithmic fairness aims to design and analyze algorithms capable of providing predictions that are both reliable and equitable.
In this talk, I will introduce one of the main approaches to achieving this goal: statistical fairness. After outlining the basic principles of this approach, I will focus specifically on a fairness criterion known as "demographic parity," which seeks to ensure that the distribution of predictions is identical across different populations. I will then discuss recent results related to regression and classification problems under this fairness constraint, exploring scenarios where differentiated treatment of populations is either permitted or prohibited.
Monday, March 31, 2:00pm
Loucas Pillaud-Vivien (Ecole des Ponts)
Learning Gausssian multi-index models via gradient flow
We study gradient flow on the multi-index regression problem for high-dimensional Gaussian data. Multi-index functions consist of a composition of an unknown low-rank linear projection and an arbitrary unknown, low-dimensional link function. As such, they constitute a natural template for feature learning in neural networks. We consider a two-timescale algorithm, whereby the low-dimensional link function is learnt with a non-parametric model infinitely faster than the subspace parametrizing the low-rank projection. By appropriately exploiting the matrix semigroup structure arising over the subspace correlation matrices, we establish global convergence of the resulting Grassmannian population gradient flow dynamics, and provide a quantitative description of its associated 'saddle-to-saddle' dynamics.