Below are past sessions from 2021-2022. See here for 2022-23 sessions.
Information on the 2020-2021 sessions can be found on the previous seminar page.
Monday, September 6, 2:00pm (online and room 1001)
Speaker: Stanislav Minsker (University of South California)
Title: Towards robust and efficient mean estimation
Abstract: Several constructions of the estimators of the mean of a random variable that admit sub-Gaussian deviation guarantees and are robust to adversarial contamination under minimal assumptions have been suggested in the literature. The goal of this talk is to discuss the size of constants appearing in the bounds, both asymptotic and non-asymptotic, satisfied by the median-of-means estimator and its analogues. We will describe a permutation-invariant version of the median-of-means estimator and show that it is asymptotically efficient, unlike its “standard" version. Finally, applications and extensions of these results to robust empirical risk minimization will be discussed.
Monday, September 13, 2:00pm (online)
Speaker: Cheng Mao (Georgia Tech)
Title: Optimal Sparse Recovery of a Planted Vector in a Subspace
Abstract: We consider the task of recovering a pN-sparse vector planted in an n-dimensional random subspace of R^N, given an arbitrary basis for the subspace. We give an improved analysis of (a slight variant of) a spectral method proposed by Hopkins, Schramm, Shi, and Steurer (STOC 2016), showing that it succeeds when np << sqrt(N). This condition improves upon the best known guarantees of any polynomial-time algorithm. Our analysis also applies to the dense case p=1, provided that the planted vector has entries sufficiently different from Gaussian. Furthermore, we give a matching lower bound, showing that when np >> sqrt(N), a general class of spectral methods fail to detect the planted vector. This yields a tight characterization of the power of this class of spectral methods and may suggest that no polynomial-time algorithm can succeed when np >> sqrt(N).
Monday, September 27, 3:00pm (online)
Speaker: Michael Nussbaum (Cornell University)
Title: Asymptotic Inference in a Class of Quantum Time Series Models
Abstract: We consider a statistical model of a n-mode quantum Gaussian state which is shift invariant and also gauge invariant. Such models can be considered analogs of classical Gaussian stationary time series, parametrized by their spectral density. Defining an appropriate quantum spectral density as the parameter, we establish that the quantum Gaussian time series model is asymptotically equivalent to a classical nonlinear regression model given as a collection of independent geometric random variables. The asymptotic equivalence is established in the sense of the quantum Le Cam distance between statistical models (experiments). The geometric regression model has a further classical approximation as a certain Gaussian white noise model with the quantum spectral density as signal. In this sense, the result is a quantum analog of the classical asymptotic equivalence of spectral density estimation and Gaussian white noise, which is known for Gaussian stationary time series. Our result confirms a conjecture which arises from the form of the quantum Chernoff bound for binary hypothesis testing in the gauge invariant and shift invariant bosonian Gaussian state model.
Monday, October 4, 2:00pm (online)
Speaker: Richard Gill (Leiden University)
Title: Bell experiments, Bell-denialism, and the quantum Randi challenge
Abstract: John S. Bell’s 1965 theorem, as a piece of pure mathematics, states that the predictions of quantum mechanics cannot be reproduced by theories of a classical physical nature, respecting in particular notions of locality and realism. One can also apply the reasoning behind the theorem to the results of experiments, such as the famous experiment of Alain Aspect in 1981, and go a step further: modulo statistical uncertainty due to the finite amount of data, real laboratory results cannot be reproduced by any mathematical framework compatible with what is now called local realism. It’s not just about quantum mechanics. Aspect’s experiment, however, had a number of (at the time unavoidable) technical weaknesses. His data could be “explained” by classical, local and realistic processes. At last, four experiments in 2015 (one in Delft) were performed under such stringent laboratory conditions that only “metaphysical” loopholes (such as superdeterminism or retrocausality) remain viable. These “loophole-free” experiments formed the culmination of many years of progress in, step by step, circumventing various “technical” weaknesses of Aspect’s experiments, unavoidable at the time.
I will give a short overview of this progress, focusing on statistical and probabilistic issues, and also speculating on what will happen next. There is still spirited opposition to Bell’s findings, and lively debate in the foundations of physics. Many believe that the problematic reconciliation of relativity and quantum theories is deeply connected to the issues raised by Bell. I will draw a connection with networked computer simulation models. One can forget about physics and see Bell’s theorem as an impossibility theorem in computer science concerning certain tasks to be performed in distributed fashion by networked classical computers. The task should, however, be achievable by quantum computers connected by quantum internet links!
Monday, October 11, 3:00pm (online)
Speaker: Jonathan Scarlett (National University of Singapore)
Title: Recent Developments in High-Dimensional Estimation with Generative Priors
Abstract: The problem of estimating an unknown vector (or image) from linear or non-linear measurements has a long history in statistics, machine learning, and signal processing. Classical studies focus on the "n >> p" regime (#measurements >> #parameters), and more recent studies handle the "n << p" regime by exploiting low-dimensional structure such as sparsity or low-rankness. Such variants are commonly known as compressive sensing. In this talk, I will overview recent methods that move beyond these explicit notions of structure, and instead assume that the underlying vector is well-modeled by a data-driven generative model (e.g., produced by deep learning methods such as GANs). I will focus primarily on theoretical developments, including upper and lower bounds on the sample complexity in terms of various properties of the generative model, such as its number of latent (input) parameters, its Lipschitz constant, and its width and depth in the special case of neural network models. I will also discuss some developments regarding non-linear models, geometric properties of the relevant optimization landscapes, and methods for fully general probabilistic priors.
Monday, November 8, 2:00pm
Speaker: Lorenzo Rosasco (Università di Genova & MIT)
Title: Interpolation and learning with scale dependent kernels
Abstract: We study the learning properties of nonparametric ridge-less least squares. In particular, we consider the common case of estimators defined by scale dependent (Matern) kernels, and focus on the role scale and smoothness. These estimators interpolate the data and the scale can be shown to control their stability to noise and sampling. Larger scales, corresponding to smoother functions, improve stability with respect to sampling. However, smaller scales, corresponding to more complex functions, improve stability to noise. We will discuss to which extent these results can explain the learning curves observed for large overparameterized models. Our analysis combines probabilistic results with analytic techniques from interpolation theory.
Monday, November 15, 3:00pm
Speaker: Daniel Hsu (Columbia University)
Title: Computational Lower Bounds for Tensor PCA
Abstract: Tensor PCA is a model statistical inference problem introduced by Montanari and Richard in 2014 for studying method-of-moments approaches to parameter estimation in latent variable models. Unlike the matrix counterpart of the problem, Tensor PCA exhibits a computational-statistical gap in the sample-size regime where the problem is information-theoretically solvable but no computationally-efficient algorithm is known. I will describe unconditional computational lower bounds on classes of algorithms for solving Tensor PCA that shed light on limitations of commonly-used solution approaches, including gradient descent and power iteration, as well as the role of overparameterization. This talk is based on joint work with Rishabh Dudeja.
Monday, November 29, 2:00pm (online)
Speaker: Julie Josse (INRIA Montpellier)
Title: Causal effect on a target population: a sensitivity analysis to handle missing covariates
Abstract: Randomized controlled trials (RCTs) are considered the gold standard approach for assessing the causal effect of an intervention on an outcome. However, they suffer from a lack of external validity when the population eligible for the RCT is significantly different from the target population of the intervention policy. On the other hand, observational data are often representative of the target population but suffer from confounding biais due to the lack of controlled experimental intervention. In this work, we combine the information gathered from experimental and observational data to generalize the results of an RCT to a target population. In order to identify the average treatment effect, one requires an ignorability assumption that implies that we observe all variables that are treatment effects modifiers and are shifted between the two sets. Standard estimators then use either weighting (IPSW), outcome modeling (G-formula), or combine the two in doubly robust approaches (AIPSW). However such covariates are often not available in both sets. Therefore, after completing existing proofs on the complete case consistency of those three estimators, we compute the expected bias induced by a missing covariate, assuming a semi-parametric linear model for the outcome. This enables sensitivity analysis for each missing covariate pattern (missing in the RCT, in the observational data or in both), giving the sign of the expected bias. We also show that there is no gain in imputing a partially-unobserved covariate. We illustrate all these results on simulations, as well as on an example from critical care medicine. This is a joint work with Benedicte Colnet, Erwan Scornet and Gael Varoquaux.
January 2022: Short course of Helmut Bölcskei (ETH Zürich), online.
4 sessions: Thursday 6 (14:00-16:30); Tuesday 11 (15:00-17:30); Thursday 13 (14:00-16:30); Tuesday 18 (15:00-17:30).
Title: Fundamental Limits of Deep Neural Network Learning
Abstract: This short course develops the fundamental limits of deep neural network learning from first principle by characterizing what is possible if no constraints on the learning algorithm and on the amount of training data are imposed. Concretely, we consider Kolmogorov-optimal approximation through deep neural networks with the guiding theme being a relation between the complexity of the function (class) to be approximated and the complexity of the approximating network in terms of connectivity and memory requirements for storing the network topology and the associated quantized weights. The theory we develop educes remarkable universality properties of deep networks. Specifically, deep networks are optimal approximants for markedly different function classes such as affine (i.e., wavelet-like) systems and Weyl-Heisenberg systems. This universality is afforded by a concurrent invariance property of deep networks to time-shifts, scalings, and frequency-shifts. In addition, deep networks provide exponential approximation accuracy—i.e., the approximation error decays exponentially in the number of non-zero weights in the network—of the multiplication operation, polynomials, sinusoidal functions, certain smooth functions, and even one-dimensional oscillatory textures and fractal functions such as the Weierstrass function, the latter two of which do not have previously known methods achieving exponential approximation accuracy. We also show that in the approximation of sufficiently smooth functions finite-width deep networks require strictly smaller connectivity than finite-depth wide networks.
The mathematical concepts forming the basis of this theory, namely metric entropy, linear and nonlinear approximation theory, best M-term approximation, and the theory of frames, will all be developed in the course.
Monday, January 24, 2022, 6:00pm (online)
Speaker: Roman Vershynin (University of California, Irvine)
Title: Mathematics of synthetic data and its privacy
Abstract: An emerging way to protect privacy is to replace true data by synthetic data. Medical records of artificial patients, for example, could retain meaningful statistical information while preserving privacy of the true patients. But what is synthetic data, and what is privacy? How do we define these concepts mathematically? Is it possible to make synthetic data that is both useful and private? I will tie these questions to a simple-looking problem in probability theory: how much information about a random vector X is lost when we take conditional expectation of X with respect to some sigma-algebra?
This talk is based on a series of papers joint with March Boedihardjo and Thomas Strohmer, mainly this one: https://arxiv.org/abs/2107.05824
Monday, January 31, 2022, 2:00pm (online)
Speaker: Yihong Wu (Yale University)
Title: Self-regularizing property of nonparametric maximum likelihood estimator in mixture models
Abstract: Introduced by Kiefer and Wolfowitz 1956, the nonparametric maximum likelihood estimator (NPMLE) is a widely used methodology for learning mixture models and empirical Bayes estimation. Sidestepping the non-convexity in mixture likelihood, the NPMLE estimates the mixing distribution by maximizing the total likelihood over the space of probability measures, which can be viewed as an extreme form of overparameterization.
In this work we discover a surprising property of the NPMLE solution. Consider, for example, a Gaussian mixture model on the real line with a subgaussian mixing distribution. Leveraging complex-analytic techniques, we show that with high probability the NPMLE based on a sample of size n has O(\log n) atoms (mass points), significantly improving the deterministic upper bound of n due to Lindsay 1983. Notably, any such Gaussian mixture is statistically indistinguishable from a finite one with O(\log n) components (and this is tight for certain mixtures). Thus, absent any explicit form of model selection, NPMLE automatically chooses the right model complexity, a property we term self-regularization. Statistical applications and extensions to other exponential families will be given. Time permitting, we will discuss recent progress and open problems on the optimal regret in empirical Bayes and the role of NPMLE. This is based on joint work with Yury Polyanskiy (MIT): https://arxiv.org/abs/2008.08244 and https://arxiv.org/abs/2109.03943.
Monday, February 14, 2022, 2:00pm (on-site)
Speaker: Francis Bach (INRIA & ENS)
Title: Statistics, Machine Learning, and Optimization with Kernel Sums-of-Squares
Abstract: In this talk, I will present recent work on representing non-negative functions with infinite-dimensional sums of squares, with applications to non-convex optimization, optimal transport, optimal control, Bayesian inference, and shape-constrained optimization.
Monday, March 14, 2022, 2:00pm (on site)
Speaker: Tim Cannings (University of Edinburgh)
Title: Adaptive transfer learning
Abstract: In transfer learning, we wish to make inference about a target population when we have access to data both from the distribution itself, and from a different but related source distribution. We introduce a flexible framework for transfer learning in the context of binary classification, allowing for covariate dependent relationships between the source and target distributions that are not required to preserve the Bayes decision boundary. Our main contributions are to derive the minimax optimal rates of convergence (up to polylogarithmic factors) in this problem, and show that the optimal rate can be achieved by an algorithm that adapts to key aspects of the unknown transfer relationship, as well as the smoothness and tail parameters of our distributional classes. This optimal rate turns out to have several regimes, depending on the interplay between the relative sample sizes and the strength of the transfer relationship, and our algorithm achieves optimality by careful, decision tree-based calibration of local nearest-neighbour procedures. This is joint work with Henry Reeve (Bristol) and Richard Samworth (Cambridge), and the associated paper is available at https://arxiv.org/abs/2106.04455.
Monday, March 21, 2022, 2:00pm (on site)
Speaker: Azadeh Khaleghi (Lancaster University)
Title: Inferring the mixing properties of an ergodic process
Abstract: In this talk I will introduce some of our recent results on the estimation of mixing coefficients from stationary ergodic sample paths. Specifically, we have proposed strongly consistent estimators of the ℓ1 norm of the sequence of α-mixing (respectively β-mixing) coefficients of a stationary ergodic process. We have also provided strongly consistent estimators of individual α-mixing (respectively β-mixing) coefficients for a subclass of stationary α-mixing (respectively β-mixing) processes with summable sequences of mixing coefficients. The estimators are in turn used in the development of hypothesis tests for mixing rates. I will end with a discussion on the potential application of these results to the approximation of the optimal strategy in restless multi-armed bandits.
Monday, March 28, 2022, 2:00pm (on site)
Speakers: Flore Sentenac (ENSAE-CREST)
Title: Robust estimation of discrete distributions under local differential privacy
Abstract: Although robust learning and local differential privacy are both widely studied fields of research, combining the two settings is an almost unexplored topic. We consider the problem of estimating a discrete distribution in total variation from n contaminated data batches under a local differential privacy constraint. A fraction 1−ϵ of the batches contain k i.i.d. samples drawn from a discrete distribution p over d elements. To protect the users' privacy, each of the samples is privatized using an α-locally differentially private mechanism. The remaining ϵn batches are an adversarial contamination. The minimax rate of estimation under contamination alone, with no privacy, is known, up to a √ log(1/ϵ) factor. Under the privacy constraint alone, the minimax rate of estimation is also known. We characterize the minimax estimation rate under the two constraints up to a √ log(1/ϵ) factor, which is larger than the sum of the two separate rates. We provide a polynomial-time algorithm achieving this bound, as well as a matching information theoretic lower bound.
Monday, April 4, 2022, 2:00pm (on site)
Speaker: Mohamed Ndaoud (ESSEC - CREST)
Title: Variable selection, monotone likelihood ratio and group sparsity
Abstract: In the pivotal variable selection problem, we derive the ex- act non-asymptotic minimax selector over the class of all s-sparse vectors, which is also the Bayes selector with respect to the uniform prior. While this optimal selector is, in general, not realizable in polynomial time, we show that its tractable counterpart (the scan selector) attains the minimax expected Hamming risk to within factor 2, and is also exact minimax with respect to the probability of wrong recovery. As a consequence, we establish explicit lower bounds under the monotone likelihood ratio property and we obtain a tight characterization of the minimax risk in terms of the best separable selector risk. We apply these general results to derive necessary and sufficient conditions of exact and almost full recovery in the location model with light tail distributions and in the problem of group variable selection under Gaussian noise. -- Joint with Cristina Butucea, Enno Mammen and Alexandre Tsybakov
Monday, May 2, 2022, 2:00pm (on site)
Speaker: Guillaume Lecué (ENSAE, CREST)
Title: A geometrical viewpoint on the benign overfitting property of the minimum l_2-norm interpolant estimator
Abstract: Practitioners have observed that some deep learning models generalize well even with a perfect fit to noisy training data. Since then many theoretical works have revealed some facets of this phenomenon known as benign overfitting. In particular, in the linear regression model, the minimum l_2-norm interpolant estimator \hat\beta has received a lot of attention since it was proved to be consistent even though it perfectly fits noisy data under some condition on the covariance matrix \Sigma of the input vector. Motivated by this phenomenon, we study the generalization property of this estimator from a geometrical viewpoint. Our main results extend and improve the convergence rates as well as the deviation probability from [Tsigler and Bartlett, 2021]. Our proof differs from the classical bias/variance analysis and is based on the self-induced regularization property: \hat\beta can be written as a sum of a ridge estimator \hat\beta_{1:k} and an overfitting component \hat\beta_{k+1:p} which follows a decomposition of the features space R^p=V_{1:k} + V_{k+1:p} into the space V_{1:k} spanned by the top k eigenvectors of Sigma and the ones V_{k+1:p} spanned by the p-k last ones. We also prove a matching lower bound for the expected prediction risk. The two geometrical properties of random Gaussian matrices at the heart of our analysis are the Dvoretsky-Milman theorem and isomorphic and restricted isomorphic properties. In particular, the Dvoretsky dimension appearing naturally in our geometrical viewpoint coincides with the effective rank from previous work and is the key tool to handle the behavior of the design matrix restricted to the sub-space V_{k+1:p} where overfitting happens. Joint work with Zong Shang.
Monday, May 9, 2022, 2:00pm (on-site)
Speaker: Philippe Rigollet (MIT)
Title: Variational inference via Wasserstein gradient flows
Abstract: Bayesian methodology typically generates a high-dimensional posterior distribution that is known only up to normalizing constants, making the computation even of simple summary statistics such as mean and covariance a major computational hurdle. Along with Monte Carlo Markov Chains (MCMC), Variational Inference (VI) has emerged as a central computational approach to large-scale Bayesian inference. Rather than sampling from the true posterior, VI aims at producing a simple but good approximation of the target posterior for which summary statistics are easy to compute. However, unlike MCMC theory, which is well-developed and builds on now-classical probabilistic ideas, VI is still poorly understood and dominated by heuristics. In this work, we propose a principled method for VI that builds upon the theory of gradient flows on the Bures-Wasserstein space of Gaussian measures. Akin to MCMC, it comes with theoretical guarantees when the target measure is strongly log-concave. This joint work with Francis Bach, Silvère Bonnabel, Sinho Chewi, and Marc Lambert.
Monday, May 23, 2022, 2:00pm (online)
Speaker: Maxim Panov (Skoltech)
Title: Optimal estimation in some graph and topic models
Monday, May 30, 2022, 2:00pm (on-site)
Speaker: Subhro Ghosh (University of Singapore)
Title: The unreasonable effectiveness of determinantal processes
Abstract: In 1960, Wigner published an article famously titled "The Unreasonable Effectiveness of Mathematics in the Natural Sciences”. In this talk we will, in a small way, follow the spirit of Wigner’s coinage, and explore the unreasonable effectiveness of determinantal processes (a.k.a. DPPs) far beyond their context of origin. DPPs originated in quantum and statistical physics, but have emerged in recent years to be a powerful toolbox for many fundamental learning problems. In this talk, we aim to explore the breadth and depth of these applications. On one hand, we will explore a class of Gaussian DPPs and the novel stochastic geometry of their parameter modulation, and their applications to the study of directionality in data and dimension reduction. At the other end, we will consider the fundamental paradigm of stochastic gradient descent, where we leverage connections with orthogonal polynomials to design a minibatch sampling technique based on data-sensitive DPPs ; with provable guarantees for a faster convergence exponent compared to traditional sampling. Based on the following works.
[1] Gaussian determinantal processes: A new model for directionality in data, with P. Rigollet, Proceedings of the National Academy of Sciences, vol. 117, no. 24 (2020), pp. 13207--13213
[2] Determinantal point processes based on orthogonal polynomials for sampling minibatches in SGD, with R. Bardenet and M. Lin. Advances in Neural Information Processing Systems 34