Talks are listed below in alphabetical order by speaker. The schedule is posted here. See also the pdf version.
Solving min-max problems is a central question in optimization, games, learning, and controls. Arguably the most natural algorithm is Gradient-Descent-Ascent (GDA), however since the 1970s, conventional wisdom has argued that it fails to converge even on simple problems. This failure spurred the extensive literature on modifying GDA with extragradients, optimism, momentum, anchoring, etc. In contrast, we show that GDA converges in its original form by simply using a judicious choice of stepsizes.
The key innovation is the proposal of unconventional stepsize schedules that are time-varying, asymmetric, and (most surprisingly) periodically negative. We show that all three properties are necessary for convergence, and that altogether this enables GDA to converge on the classical counterexamples (e.g., unconstrained convex-concave problems). The core intuition is that although negative stepsizes make backward progress, they de-synchronize the min/max variables (overcoming the cycling issue of GDA) and lead to a slingshot phenomenon in which the forward progress in the other iterations is overwhelmingly larger. This results in fast overall convergence. Geometrically, the slingshot dynamics leverage the non-reversibility of gradient flow: positive/negative steps cancel to first order, yielding a second-order net movement in a new direction that leads to convergence and is otherwise impossible for GDA to move in. Joint work with Henry Shugart.
Abstract
Despite their recent successes, most modern machine learning algorithms lack theoretical guarantees, which are crucial to further development towards delicate tasks. One mysterious phenomenon is that, among infinitely many possible ways to fit data, the algorithms often find the "good" ones, even when the definition of "good" is not specified by the designers. In this talk I will approach this from both the microscopic view and the macroscopic view, with empirical and theoretical study of the connection between the good solutions in neural networks and the sparse/low rank solutions in compressed sensing. The key concepts are the implicit bias/regularization in Bregman divergence of ODEs, and the neural collapse phenomenon justified by the block structure of neural tangent kernel, which can be used for out-of-distribution detection.
Abstract
In this talk, we revisit the classical theory of variance bounds for unbiased estimators and develop an analogous framework for a different notion of estimator stability, which we term sensitivity. Roughly speaking, the sensitivity of an estimator quantifies how much its output changes when its inputs are perturbed slightly. In the same way that classical inequalities like the Cramer-Rao lower bound for the variance of unbiased estimators can be interpreted geometrically via the Fisher-Rao geometry, our notion of sensitivity admits analogous inequalities and theory, only that now they arise from Wasserstein geometry. We will discuss this Wasserstein-Cramer-Rao lower bound and its associated notion of Wasserstein Fisher information (which are closely related to notions introduced by Li and Zhao, 2023), introduce the notion of Wasserstein "exponential" families and their defining properties, and introduce the concept of Wasserstein "MLEs". In particular, we will discuss that the Wasserstein MLE is, generically, asymptotically efficient for sensitivity, in the sense that it achieves the corresponding Wasserstein-Cramer-Rao lower bound. Our theory reveals new optimality properties for existing estimators and, in other cases, reveals entirely new estimators.
This is joint work with Adam Quinn Jaffe and Bodhisattva Sen.
Matching and estimating probability distributions are fundamental tasks in generative modeling applications. While optimal transport (OT) provides a natural framework for matching distributions, its large-scale computation remains challenging, especially in high dimensions. Our focus lies on an efficient iterative slice-matching approach initially introduced by Pitie et al. for transferring color statistics. The resulting sliced optimal transport is constructed by solving a sequence of one-dimensional OT problems between the sliced distributions, which are fast to compute and admit closed-form solutions. While these techniques have proven effective in various data science applications, a comprehensive understanding of their convergence properties remains lacking. Our primary aim is to establish an almost sure convergence, shedding light on the connections with stochastic gradient descent in the context of the sliced-Wasserstein distance. If time permits, we will also explore invariance, equivariance, and Lipschitz properties associated with one-step of the procedure, yielding recovery and stability results for sliced OT approximations with a single step. This talk is based on joint work with Caroline Moosmueller and Yongzhe Wang.
Clustering, low-rank approximation, and nonnegative matrix factorization can be phrased as data summarization problems based on selecting a small number of prototypes from a large data set. We provide a unified formulation of these problems and many others. Natural algorithms for solving these problems include adaptive search, which is a greedy deterministic selection rule, and adaptive sampling, which is a randomized selection rule based on distances to existing prototypes. We evaluate the utility of the two approaches theoretically and empirically, pointing out when each approach is preferable.
Dimension reduction algorithms, such as principal component analysis (PCA), multidimensional scaling (MDS), and stochastic neighbor embeddings (SNE and tSNE), are an important tool for data exploration, visualization, and subgroup identification. While these algorithms see broad application across many scientific fields, our theoretical understanding of non-linear dimension reduction algorithms remains limited. This talk will describe new results that identify large data limits for MDS and tSNE using tools from the Calculus of Variations. Along the way, we will showcase situations where standard libraries give outputs that are misleading, and propose new computational algorithms to mitigate these issues and improve efficiency. Connections with the celebrated Gromov-Wasserstein distance and manifold learning will also be highlighted.
Deep neural networks (DNNs) have been successful high-dimensional function approximators in countless applications. However, training DNNs is notoriously challenging, requiring significant time and computational resources. In this talk, we will make training easier by exploiting commonly used network structures and incorporating second-order information. Specifically, we will describe training strategies designed for separable DNNs in which the weights of the final layer are applied linearly. We will leverage this linearity in two ways: using partial optimization in the deterministic setting and iterative sampling in the stochastic setting. We will demonstrate empirically that both approaches yield faster convergence to more accurate DNN models and less sensitivity to training hyperparameters.
This talk investigates the fundamental differences between low-norm and flat solutions of shallow ReLU networks training problems, particularly in high-dimensional settings. We show that global minima with small weight norms exhibit strong generalization guarantees that are dimension-independent. In contrast, local minima that are “flat” can generalize poorly as the input dimension increases. We attribute this gap to a phenomenon we call neural shattering, where neurons specialize to extremely sparse input regions, resulting in activations that are nearly disjoint across data points. This forces the network to rely on large weight magnitudes, leading to poor generalization. Our theoretical analysis establishes an exponential separation between flat and low-norm minima. In particular, while flatness does imply some degree of generalization, we show that the corresponding convergence rates necessarily deteriorate exponentially with input dimension. These findings suggest that flatness alone does not fully explain the generalization performance of neural networks.
Transportation of measure underlies many contemporary methods in machine learning and statistics. Sampling, which is a fundamental building block in computational science, can be done efficiently given an appropriate measure-transport map. We ask: what is the effect of using approximate maps in such algorithms? At the core of our analysis is the theory of optimal transport regularity, approximation theory, and an emerging class of inequalities, previously studied in the context of uncertainty quantification (UQ).
Abstract