Julia Ackermann
(with Arnulf Jentzen, Thomas Kruse, Benno Kuckuck, and Joshua Lee Padgett)
Title: Deep neural networks overcome the curse of dimensionality for space-time solutions of semilinear PDEs
Approximating high-dimensional partial differential equations (PDEs) is a challenging task due to the curse of dimensionality (COD). In the last few years, it has been shown that deep neural networks (DNNs) have the expressive power to overcome the COD in such PDE approximation tasks. This means that there exists an approximating sequence of DNNs such that the number of parameters grows at most polynomially in the PDE dimension and the reciprocal of the prescribed approximation accuracy. The general proof strategy is of a probabilistic nature and employs approximation results of the PDE solution by multi-level Picard approximations, based on which the DNNs are constructed. In this talk, I present how this approach can be used to establish that DNNs with the ReLU, leaky ReLU or softplus activation function have the power to approximate solutions of semilinear heat PDEs with Lipschitz-continuous nonlinearities in the $L^p$-sense in space-time without the COD.
Steffen Dereich
(with Arnulf Jentzen)
Title: Asymptotic analysis of the Adam algorithm
In this talk, we will analyze the Adam algorithm, introduced by Kingma and Ba in 2014, for fixed parameters $\alpha$ and $\beta$ and step sizes that decay to zero. I will present new error estimates for the Adam algorithm, developed jointly with Arnulf Jentzen. Specifically, we show that the algorithm’s effective behavior is closely related to a particular vector field, which we refer to as the Adam field. If this field satisfies a local coercivity condition around one of its zeros, we can prove convergence of order $\sqrt{\gamma_n}$ (with $\gamma_n$ denoting the step-sizes of the algorithm) toward that zero, provided the iterates remain within a suitable neighborhood. Furthermore, we establish that the averaged scheme satisfies a central limit theorem under an additional appropriate non-degeneracy condition.
Aymeric Dieuleveut
(with Baptiste Goujaud, Adrien Taylor)
Title: Provable non-accelerations of the heavy-ball method
In this work, we show that the heavy-ball (HB) method provably does not reach an accelerated convergence rate on smooth strongly convex problems. More specifically, we show that for any condition number and any choice of algorithmic parameters, either the worst-case convergence rate of HB on the class of -smooth and -strongly convex quadratic functions is not accelerated (that is, slower than ), or there exists an -smooth -strongly convex function and an initialization such that the method does not converge. To the best of our knowledge, this result closes a simple yet open question on one of the most used and iconic first-order optimization technique. Our approach builds on finding functions for which HB fails to converge and instead cycles over finitely many iterates. We analytically describe all parametrizations of HB that exhibit this cycling behavior on a particular cycle shape, whose choice is supported by a systematic and constructive approach to the study of cycling behaviors of first-order methods. We show the robustness of our results to perturbations of the cycle, and extend them to class of functions that also satisfy higher-order regularity conditions.
Aymeric Dieuleveut
(with Daniel Berg Thomsen, Adrien Taylor)
Title: Tight analyses of first-order methods with error feedback
Communication between agents often constitutes a major computational bottleneck in distributed learning. One of the most common mitigation strategies is to compress the information exchanged, thereby reducing communication overhead. To counteract the degradation in convergence associated with compressed communication, error feedback schemes -- most notably and -- were introduced. In this work, we provide a tight analysis of both of these methods. Specifically, we find the Lyapunov function that yields the best possible convergence rate for each method -- with matching lower bounds. This principled approach yields sharp performance guarantees and enables a rigorous, apples-to-apples comparison between, and compressed gradient descent. Our analysis is carried out in a simplified yet representative setting, which allows for clean theoretical insights and fair comparison of the underlying mechanisms.
Benjamin Gess
(with Vitalii Konarovskyi, Sebastian Kassing)
Title: Effective fluctuating continuum models for SGD
In this talk, we present recent results on the derivation of effective models for the training dynamics of (Riemannian) stochastic gradient descent (SGD) in limits of small learning rates or large, shallow networks. The focus lies on developing effective limiting models that also capture the fluctuations inherent in SGD. This will lead to novel concepts of stochastic modified flows and distribution-dependent modified flows. The advantage of these limiting models is that they match the SGD dynamics to higher order and recover the correct multi-point distributions.
Sebastian Kassing
(with Simon Weissmann, Steffen Dereich)
Title: Beyond Strong Convexity: Geometry and Optimization under the Polyak–Łojasiewicz Condition
Many theoretical results in (stochastic) optimization have been derived under strong convexity assumptions or even for quadratic objective functions. However, such assumptions often fail to hold in modern machine learning applications, where objectives are typically non-convex. This talk explores a recent line of research that extends classical results in stochastic gradient-based optimization to broader classes of functions satisfying the Polyak–Łojasiewicz (PL) inequality, a condition that is significantly more relevant for practical deep learning models. We consider typical acceleration techniques such as Polyak’s Heavy Ball and Ruppert-Polyak averaging and use a geometric interpretation of the PL-inequality to show that many algorithmic properties extend to a more general and realistic class of objectives.
Vitalii Konarovskyi
(with Benjamin Gess and Rishabh S. Gvalani)
Title: Fluctuation Analysis of Mean-Field Limits in Overparameterized SGD
We study the mean-field limit of stochastic gradient descent (SGD) dynamics in overparameterized neural networks via their connection to nonlinear stochastic partial differential equations (SPDEs). It is well known that, in the mean-field scaling, these dynamics converge to a deterministic PDE. Our main contribution is to introduce a corresponding nonlinear SPDE and analyze the fluctuations of its solutions, showing that they are governed by the same Gaussian process that characterizes SGD fluctuations. This correspondence enables a sharper comparison between SGD and its SPDE approximation, showing that incorporating stochastic noise into the limiting description improves convergence rates and captures the fluctuation behavior of SGD.
Sophie Langer
(with Adam Krzyzak, Michael Kohler, Alina Braun)
Title: Deep Learning Theory: Statistic, Optimization and the Space Between
Johannes Schmidt-Hieber
(with Jiaqi Li, Wei-Biao Wu)
Title: CLTs for noisy SGD via geometric moment contraction
To analyse the behaviour of noisy SGD, a powerful alternative to the martingale CLT is to interpret the problem as a time series and to apply the machinery of functional dependence measures and geometric moment contraction. This is described and applied to (S)GD with dropout in the talk.
Joint work with Jiaqi Li (leading author in these projects) and Wei-Biao Wu (both U Chicago).
Johannes Schmidt-Hieber II
Title: Two elementary open problems in the theory of SGD
The talk will describe two open problems in the theory of SGD.
Mariia Seleznova
Title: The Probabilistic Effects of Depth in Deep Learning
Depth is central to modern neural networks, but its probabilistic consequences are subtle and not fully captured by classical theories. In this talk I will discuss how increasing depth alters signal propagation statistics in two settings: fully-connected networks with independent weights and linear recurrent networks with weight sharing. For fully-connected networks, mean-field initialization stabilizes second moments, but higher-order fluctuations grow exponentially with depth-to-width ratio, breaking convergence to the Neural Tangent Kernel (NTK) regime and changing predictions about training and generalization. In contrast, for linear recurrent networks, modern random matrix theory shows that forward propagation already becomes unstable at sequence lengths on the order of the square root of width, making mean-field initialization fail; the role of higher moments and the NTK picture remain open. Together these results show that depth stresses architectures in qualitatively different ways, and that infinite-width approximations are generally insufficient to capture the statistics of realistic neural networks.
Anna Shalova
(with Mark Peletier, André Schlichting)
Title: Singular-limit analysis of gradient descent with noise injection
We study the limiting dynamics of noisy gradient descent systems in the overparameterized regime. In this regime the set of global minimizers of the loss is large, and when initialized in a neighbourhood of this zero-loss set a noisy gradient descent algorithm slowly evolves along this set. In some cases this slow evolution has been related to better generalisation properties. We give an explicit characterization of this evolution for the broad class of noisy gradient descent systems. Our results show that the structure of the noise affects not just the form of the limiting process, but also the time scale at which the evolution takes place. We apply our theory to Dropout, label noise and classical SGD (minibatching) noise. We show that dropout and label noise models evolve on two different time scales. At the same time classical SGD yields a trivial evolution on both mentioned time scales, implying that an additional noise is required for regularization.
Lukas Trottner
Title: Adaptive denoising diffusion modelling via random time reversal
We introduce a new class of generative diffusion models that, unlike conventional denoising diffusion models, achieve a time-homogeneous structure for both the noising and denoising processes, allowing the number of steps to adaptively adjust based on the noise level. This is accomplished by conditioning the forward process using Doob’s h-transform, which terminates the process at a suitable sampling distribution at a random time. The model is particularly well suited for generating data with lower intrinsic dimensions, as the termination criterion simplifies to a first hitting rule. A key feature of the model is its adaptability to the target data, enabling a variety of downstream tasks using a pre-trained unconditional generative model. We highlight this point by demonstrating how our generative model may be used as an unsupervised learning algorithm: in high dimensions the model outputs with high probability the metric projection of a noisy observation $y$ of some latent data point $x$ onto the lower-dimensional support of the data – which we don’t assume to be analytically accessible but to be only represented by the unlabeled training data set of the generative model.
Simon Weissmann
Title: Convergence analysis of stochastic gradient methods under gradient domination
In this talk, we present recent advances in establishing almost sure convergence rates for stochastic gradient methods. Stochastic gradient methods are among the most important algorithms in training machine learning problems.
While classical assumptions such as strong convexity allow a simple analysis, they are rarely satisfied in applications. In recent years, global and local gradient domination properties have shown to be a more realistic replacement of strong convexity. They were proved to hold in diverse settings such as (simple) policy gradient methods in reinforcement learning and training of deep neural networks with analytic activation functions. We prove almost sure convergence rates $f(X_n)-f^*\in o\big( n^{-\frac{1}{4\beta-1}+\epsilon}\big)$ of the last iterate for stochastic gradient descent (with and without momentum) under global and local $\beta$-gradient domination assumptions. The almost sure rates get arbitrarily close to recent rates in expectation.