Talks are listed below in alphabetical order by speaker. The schedule is posted here. See also the pdf version.
Solving min-max problems is a central question in optimization, games, learning, and controls. Arguably the most natural algorithm is Gradient-Descent-Ascent (GDA), however since the 1970s, conventional wisdom has argued that it fails to converge even on simple problems. This failure spurred the extensive literature on modifying GDA with extragradients, optimism, momentum, anchoring, etc. In contrast, we show that GDA converges in its original form by simply using a judicious choice of stepsizes.
The key innovation is the proposal of unconventional stepsize schedules that are time-varying, asymmetric, and (most surprisingly) periodically negative. We show that all three properties are necessary for convergence, and that altogether this enables GDA to converge on the classical counterexamples (e.g., unconstrained convex-concave problems). The core intuition is that although negative stepsizes make backward progress, they de-synchronize the min/max variables (overcoming the cycling issue of GDA) and lead to a slingshot phenomenon in which the forward progress in the other iterations is overwhelmingly larger. This results in fast overall convergence. Geometrically, the slingshot dynamics leverage the non-reversibility of gradient flow: positive/negative steps cancel to first order, yielding a second-order net movement in a new direction that leads to convergence and is otherwise impossible for GDA to move in. Joint work with Henry Shugart.
Flow-based models have spurred a revolution in generative modeling, driving astounding advancements across diverse domains including high-resolution text to image synthesis and de-novo drug design. Yet despite their remarkable performance, inference in these models requires the solution of a differential equation, which is extremely costly for the large-scale neural network-based models used in practice. In this talk, we introduce a mathematical theory of flow maps, a new class of generative models that directly learn the solution operator for a flow-based model. By learning this operator, flow maps can generate data in 1-4 network evaluations, leading to orders of magnitude faster inference compared to standard flow-based models. We discuss several algorithms for efficiently learning flow maps in practice that emerge from our theory, and we show how many popular recent methods for accelerated inference -- including consistency models, shortcut models, align your flow, and mean flow -- can be viewed as particular cases of our formalism. We demonstrate the practical effectiveness of flow maps across several tasks including image synthesis, geometric data generation, and inference-time guidance of pre-trained text-to-image models.
Despite their recent successes, most modern machine learning algorithms lack theoretical guarantees, which are crucial to further development towards delicate tasks. One mysterious phenomenon is that, among infinitely many possible ways to fit data, the algorithms often find the "good" ones, even when the definition of "good" is not specified by the designers. In this talk I will approach this from both the microscopic view and the macroscopic view, with empirical and theoretical study of the connection between the good solutions in neural networks and the sparse/low rank solutions in compressed sensing. The key concepts are the implicit bias/regularization in Bregman divergence of ODEs, and the neural collapse phenomenon justified by the block structure of neural tangent kernel, which can be used for out-of-distribution detection.
Many modern learning tasks require models that can take inputs of varying sizes. Consequently, dimension-independent architectures have been proposed for domains where the inputs are graphs, sets, and point clouds. Recent work on graph neural networks has explored whether a model trained on low-dimensional data can transfer its performance to higher-dimensional inputs. We extend this body of work by introducing a general framework for transferability across dimensions. We show that transferability corresponds precisely to continuity in a limit space formed by identifying small problem instances with equivalent large ones. This identification is driven by the data and the learning task. We instantiate our framework on existing architectures, and implement the necessary changes to ensure their transferability. Finally, we provide design principles for designing new transferable models. Numerical experiments support our findings.
The quantity of interest in the classical Cramér-Rao theory of unbiased estimation (e.g., the Cramér-Rao lower bound, its exact attainment for exponential families, and asymptotic efficiency of maximum likelihood estimation) is the variance, which represents the instability of an estimator when its value is compared to the value for an independently-sampled data set from the same distribution. In this talk we are interested in a quantity which represents the instability of an estimator when its value is compared to the value for an infinitesimal additive perturbation of the original data set; we refer to this as the "sensitivity" of an estimator. The resulting theory of sensitivity is based on the Wasserstein geometry in the same way that the classical theory of variance is based on the Fisher-Rao (equivalently, Hellinger) geometry, and this insight allows us to determine a collection of results which are analogous to the classical case: a Wasserstein-Cramér-Rao lower bound for the sensitivity of any unbiased estimator, a characterization of models in which there exist unbiased estimators achieving the lower bound exactly, and some concrete results that show that the Wasserstein projection estimator achieves the lower bound asymptotically. I'll discuss how these results allow us to treat many statistical examples, sometimes revealing new optimality properties for existing estimators and other times revealing entirely new estimators.
This is joint work with Adam Quinn Jaffe (Columbia) and Bodhisattva Sen (Columbia).
Matching and estimating probability distributions are fundamental tasks in generative modeling applications. While optimal transport (OT) provides a natural framework for matching distributions, its large-scale computation remains challenging, especially in high dimensions. Our focus lies on an efficient iterative slice-matching approach initially introduced by Pitie et al. for transferring color statistics. The resulting sliced optimal transport is constructed by solving a sequence of one-dimensional OT problems between the sliced distributions, which are fast to compute and admit closed-form solutions. While these techniques have proven effective in various data science applications, a comprehensive understanding of their convergence properties remains lacking. Our primary aim is to establish an almost sure convergence, shedding light on the connections with stochastic gradient descent in the context of the sliced-Wasserstein distance. If time permits, we will also explore invariance, equivariance, and Lipschitz properties associated with one-step of the procedure, yielding recovery and stability results for sliced OT approximations with a single step. This talk is based on joint work with Caroline Moosmueller and Yongzhe Wang.
Clustering, low-rank approximation, and nonnegative matrix factorization can be phrased as data summarization problems based on selecting a small number of prototypes from a large data set. We provide a unified formulation of these problems and many others. Natural algorithms for solving these problems include adaptive search, which is a greedy deterministic selection rule, and adaptive sampling, which is a randomized selection rule based on distances to existing prototypes. We evaluate the utility of the two approaches theoretically and empirically, pointing out when each approach is preferable.
Dimension reduction algorithms, such as principal component analysis (PCA), multidimensional scaling (MDS), and stochastic neighbor embeddings (SNE and tSNE), are an important tool for data exploration, visualization, and subgroup identification. While these algorithms see broad application across many scientific fields, our theoretical understanding of non-linear dimension reduction algorithms remains limited. This talk will describe new results that identify large data limits for MDS and tSNE using tools from the Calculus of Variations. Along the way, we will showcase situations where standard libraries give outputs that are misleading, and propose new computational algorithms to mitigate these issues and improve efficiency. Connections with the celebrated Gromov-Wasserstein distance and manifold learning will also be highlighted.
Deep neural networks (DNNs) have been successful high-dimensional function approximators in countless applications. However, training DNNs is notoriously challenging, requiring significant time and computational resources. In this talk, we will make training easier by exploiting commonly used network structures and incorporating second-order information. Specifically, we will describe training strategies designed for separable DNNs in which the weights of the final layer are applied linearly. We will leverage this linearity in two ways: using partial optimization in the deterministic setting and iterative sampling in the stochastic setting. We will demonstrate empirically that both approaches yield faster convergence to more accurate DNN models and less sensitivity to training hyperparameters.
This talk investigates the fundamental differences between low-norm and flat solutions of shallow ReLU networks training problems, particularly in high-dimensional settings. We show that global minima with small weight norms exhibit strong generalization guarantees that are dimension-independent. In contrast, local minima that are “flat” can generalize poorly as the input dimension increases. We attribute this gap to a phenomenon we call neural shattering, where neurons specialize to extremely sparse input regions, resulting in activations that are nearly disjoint across data points. This forces the network to rely on large weight magnitudes, leading to poor generalization. Our theoretical analysis establishes an exponential separation between flat and low-norm minima. In particular, while flatness does imply some degree of generalization, we show that the corresponding convergence rates necessarily deteriorate exponentially with input dimension. These findings suggest that flatness alone does not fully explain the generalization performance of neural networks.
Transportation of measure underlies many contemporary methods in machine learning and statistics. Sampling, which is a fundamental building block in computational science, can be done efficiently given an appropriate measure-transport map. We ask: what is the effect of using approximate maps in such algorithms? At the core of our analysis is the theory of optimal transport regularity, approximation theory, and an emerging class of inequalities, previously studied in the context of uncertainty quantification (UQ).
This talk investigates low-rank structure in the gradients of the training loss for two-layer neural networks while relaxing the usual isotropy assumptions on the training data and parameters. We consider a spiked data model in which the bulk can be anisotropic and ill-conditioned, we do not require independent data and weight matrices and we also analyze both the mean-field and neural-tangent-kernel scalings. We show that the gradient with respect to the input weights is approximately low rank and is dominated by two rank-one terms: one aligned with the bulk data-residue, and another aligned with the rank one spike in the input data. We characterize how properties of the training data, the scaling regime and the activation function govern the balance between these two components. Additionally, we also demonstrate that standard regularizers, such as weight decay, input noise and Jacobian penalties, also selectively modulate these components.