In many machine learning applications, especially in healthcare and natural sciences, interpretability is as important as prediction accuracy. Random forests do not only offer state-of-the-art prediction accuracy, but also interpretability through feature and interaction importance scores. However, when features are correlated, the interpretability of these scores can be problematic.
In the first part of this talk, we present consistency results for the recovery of feature interactions from random forests, assuming feature independence. We highlight the challenges posed by feature correlation, which can lead to misleading importance scores.
In the second part, we discuss a method to mitigate the effects of feature correlation, namely local sample weighting. We demonstrate how this approach improves interpretability in random forests. Moreover, we discuss how this approach can also be applied in minibatch training, e.g., of neural networks.
Physics-informed machine learning combines the expressiveness of data-based approaches with the interpretability of physical models. In this context, we consider a general regression problem where the empirical risk is regularized by a partial differential equation that quantifies the physical inconsistency. Practitioners often resort to physics-informed neural networks (PINNs) to solve this kind of problem. After discussing some strengths and limitations of PINNs, we prove that for linear differential priors, the problem can be formulated directly as a kernel regression task, giving a rigorous framework to analyze physics-informed ML. In particular, the physical prior can help in boosting the estimator convergence. We also propose the PIKL algorithm (PIKL for physics-informed kernel learning) as a numerical strategy to implement this kernel method.
Joint work with Nathan Doumèche, Gérard Biau, Francis Bach.
In nonparametric random design regression, we consider ReLU-neural networks with weight parameters randomly sampled from a well-chosen heavy-tailed distribution. The architecture of the network is taken deterministic and `overfitting’, in the sense that the width of the network can be chosen much larger than the oracle width (the one we would chose if knowing the regularity or structure parameters of the regression function).
We show that the corresponding Bayesian posterior distribution automatically achieves optimal adaptive rates (up to logarithmic factors) in terms of `smoothness’ and `structure’ parameters, in a number of settings including compositional structures, Besov anisotropic spaces and design points sitting on a low-dimensional manifold. We also derive contraction rates for mean-field Variational Bayes counterparts.
We discuss connections with the setting of random series expansions on a basis, as well as work in progress for other distributions on the network parameters.
(Based on joint works with Paul Egels and Sergios Agapiou)
This paper studies the phenomenon of benign overfitting in the setting of nonparametric regression. We propose using local polynomials to construct an estimator of the regression function with the following two properties. First, this estimator is minimax-optimal over Hölder classes. Second, it is a continuous function that interpolates the set of observations with high probability. The key element of the construction is the use of singular kernels. Moreover, we demonstrate that adaptation to unknown smoothness is compatible with benign overfitting: indeed, we propose another interpolating estimator that achieves minimax optimality adaptively to the unknown Hölder smoothness. Our results highlight that in the nonparametric regression model, interpolation can be fundamentally decoupled from the bias-variance tradeoff.
We analyze the landscape and training dynamics of diagonal linear networks in a linear regression task, with the network parameters being perturbed by small isotropic normal noise. The addition of such noise may be interpreted as a stochastic form of sharpness-aware minimization (SAM) and we prove several results that relate its action on the underlying landscape and training dynamics to the sharpness of the loss. In particular, the noise changes the expected gradient to force balancing of the weight matrices at a fast rate along the descent trajectory. In the diagonal linear model, we show that this equates to minimizing the average sharpness, as well as the trace of the Hessian matrix, among all possible factorizations of the same matrix. Further, the noise forces the gradient descent iterates towards a shrinkage-thresholding of the underlying true parameter, with the noise level explicitly regulating both the shrinkage factor and the threshold. Joint work with Sophie Langer and Johannes Schmidt-Hieber.
Forward gradient descent (FGD) has been proposed as a biologically more plausible alternative of gradient descent as it can be computed without backward pass. Considering the linear model with 𝑑 parameters, previous work has found that the prediction error of FGD is, however, by a factor 𝑑 slower than the prediction error of stochastic gradient descent (SGD).In this paper we show that by computing ℓ FGD steps based on each training sample, this suboptimality factor becomes 𝑑/(ℓ ∧𝑑) and thus the suboptimality of the rate disappears if ℓ ≳𝑑. We also show that FGD with repeated sampling can adapt to low-dimensional structure in the input distribution. The main mathematical challenge lies in controlling the dependencies arising from the repeated sampling process.
Hebbian learning is one of the most fundamental learning rules for biological neural networks and postulates that synaptic changes occur locally, depending on the activities of pre- and postsynaptic neurons. While Hebbian learning based on neuronal firing rates is well explored, much less is known about learning rules that account for precise spike timing. We relate a Hebbian spike-time dependent plasticity rule to stochastic gradient descent with respect to a natural loss function. This connection allows us to show that the learning rule successfully identifies the presynaptic neuron with the highest activity.
This is joint work with Niklas Dexheimer and Johannes Schmidt-Hieber.
Learning in the brain is widely considered to be due to the plasticity of synapses between neurons. While synaptic plasticity between pairs of neurons in the brain is well studied, how these local changes at synapses enable the learning of global function at the neural network level is unclear. Various neural network models with bio-inspired / bio-plausible learning rules for memory, learning to compute and predict, reinforcement learning, and so on have been proposed. In this tutorial, we will consider: a) models for synaptic plasticity and their effects at the network level from a bottom-up perspective in the first part (1 hour), and b) the design of networks that learn function via bio-inspired learning rules from a top-down perspective in the second part (1-hour).
Physics-Informed Neural Networks (PINNs) provide a flexible, mesh-free approach to solving differential equations. While their implementation is straightforward and they hold great potential for high-dimensional, inverse, and nonlinear problems, training remains challenging due to their sensitivity to weight initialization, hyperparameter selection, scalability issues, spectral bias, and ill-conditioning.
This talk explores how overlapping domain decomposition (DD) techniques can improve the convergence and efficiency of PINNs. Different strategies for integrating DD with PINNs are investigated. First, classical PINNs are enhanced using multilevel DD-based architectures to improve performance. Additionally, this strategy is combined with multifidelity stacking PINNs for time-dependent problems, demonstrating clear improvements over reference results without DD. Second, randomized neural networks are considered, where hidden layer weights are randomly initialized and fixed, reducing the problem to a linear least-squares formulation for linear differential operators. In this setting, DD-based architectures, combined with overlapping Schwarz preconditioning, accelerate convergence. Finally, improvements in physics-informed neural operators through DD-based architectures are explored, enabling the efficient approximation of solution operators for parameterized problems.
Numerical experiments on multiscale and wave phenomena demonstrate the effectiveness of DD techniques in PINNs. These results highlight the potential of DD methods to significantly enhance computational efficiency and accuracy in physics-informed machine learning.
In recent years, statistical methodology based on optimal transport (OT) witnessed a considerable increase in practical and theoretical interest. A central reason for this trend is the ability of optimal transport to effectively compare data in a geometrically meaningful way. This development was further amplified by computational advances spurred by the introduction of entropy regularized optimal transport (EOT). In applications, the OT or EOT cost are often estimated through an empirical plug-in approach, raising questions about the statistical performance of these estimators. Remarkably, under distinct population measures with different intrinsic dimensions, we show that the convergence rate for the empirical OT cost is determined by the population measure with lower intrinsic dimension -- a novel phenomenon we term "lower complexity adaptation“. For the empirical EOT cost, we establish a similar phenomenon and show that the dependency on the entropy regularization parameter in the convergence rate is determined by the minimum intrinsic dimension of the population measures. This phenomenon also has implications for the estimation of the OT map. As an outlook we will discuss statistical aspects of neural OT procedures.
Stochastic gradient descent (SGD) optimization methods are nowadays the method of choice for the training of deep neural networks (DNNs) in artificial intelligence systems. In practically relevant training problems, usually not the plain vanilla standard SGD method is the employed optimization scheme but instead suitably accelerated and adaptive SGD optimization methods such as the famous Adam optimizer are applied. In this work we establishing optimal convergence rates for the Adam optimizer for a large class of stochastic optimization problems covering strongly convex stochastic optimization problems. The key ingredient of our convergence analysis is a new vector field function which we propose to refer to as the Adam vector field. This Adam vector field accurately describes the macroscopic behaviour of the Adam optimization process but differs from the negative gradient of the objective function (the function we intend to minimize) of the considered stochastic optimization problem. In particular, our convergence analysis proves that Adam does typically not converge to critical points of the objective function (zeros of the gradient of the objective function) of the considered optimization problem but converges with rates to zeros of this Adam vector field. The talk is based on joint works with Steffen Dereich, Thang Do, Robin Graeber, and Adrian Riekert.
A theoretically well-founded deep learning algorithm for nonparametric regression is presented. It uses over-parametrized deep neural networks with logistic activation function, which are fitted to the given data via gradient descent. Motivated by a theoretical analysis of deep learning which takes into account simultaneously optimization, generalization and approximation, a special topology of these networks, a special random initialization of the weights, and a data-dependent choice of the learning rate and the number of gradient descent steps is proposed. A theoretical bound on the expected L_2 error of this estimate is presented, and its finite sample size performance is illustrated by applying it to simulated data.
We show that deep Heaviside networks (DHNs) have limited expressiveness but that this can be overcome by including either skip connections or neurons with linear activation. We provide lower and upper bounds for the Vapnik-Chervonenkis (VC) dimensions and approximation rates of these network classes. As an application, we derive statistical convergence rates for DHN fits in the nonparametric regression model.
This is joint work with Juntong Chen, Sophie Langer and Johannes Schmidt-Hieber.
Recently, flow matching, introduced by Lipman et. al. (2022), has caused increasing interest in generative modelling. Using a solution to an ODE leads to a generation process that is much simpler compared to diffusions, the current state of the art generative model. This idea has been further developed and several adaptations, especially regarding the choice of conditional probability paths, have been presented in the literature. Exploiting the connection to kernel density estimation, we analyze flow matching from a statistical perspective. We derive reasonable conditions for the choice of conditional probability paths and study the rate of convergence.
We consider neural networks with polynomial and rational activation functions. The choice of activation function in deep learning architectures is crucial for practical tasks and largely impacts the performance of a neural network. Leveraging tools from numerical algebraic geometry, we establish precise measures for the expressive power of neural networks with polynomial activation functions, by studying the image of the parametrization map from weights to functions, which forms an irreducible algebraic variety upon taking closure. In addition, we study the presence or absence of spurious valleys in the loss surface and contrast the topological properties of the loss landscape when activation coefficients are fixed versus trainable.
Surprisingly, many optimisation phenomena observed in complex neural networks also appear in so-called 2-layer diagonal linear networks. This rudimentary architecture—a two-layer feedforward linear network with a diagonal inner weight matrix—has the advantage of revealing key training characteristics while keeping the theoretical analysis clean and insightful.
In this talk, I’ll provide an overview of various theoretical results for this architecture, while drawing connections to experimental observations from practical neural networks. Specifically, we’ll examine how hyperparameters such as the initialisation scale, step size, and batch size impact the optimisation trajectory and influence the generalisation performances of the recovered solution.
Classification and regression trees find widespread applications in statistics, but their theoretical analysis is notoriously difficult. A standard procedure to determine the tree builds a large tree and then uses pruning algorithms to reduce the tree and to avoid overfitting. We propose a different approach based on early stopping where the tree is only grown until the residual norm drops under a certain threshold. As basic growing algorithms we consider the classical global and a new semi-global approach. We prove an oracle-type inequality for the early stopping prediction error compared to the (random, parameter-dependent) oracle tree along the tree growing procedure. The error due to early stopping is usually negligible when the tree is built from an independent sample and it is at most of the same order as the standard risk bound without sample splitting. Numerical examples confirm good statistical performance and high computational gains compared to classical methods. (joint work with Ratmir Miftachov, Berlin)
Transformers are, arguably, the state-of-the-art models in natural language processing and various other applications. To build a theoretical understanding of transformers, a toy transformer model, allowing for a wide range of analytical tools, was recently introduced by Geshkovski et al. (2024). In this talk I will discuss the aforementioned toy transformer model and its noisy counterpart, McKean-Vlasov PDE with exponential kernel. The talk is mostly based on our recent work with André Schlichting (arXiv:2412.14813) in which we study stationary solutions of McKean-Vlasov equation on Riemannian manifolds. Our analysis extends the equivalence of the energetic problem formulation to the manifold setting and allows to characterize critical points of the corresponding free energy functional. On a sphere, we employ the properties of spherical convolution to study the bifurcation branches around the uniform state and characterize the phase transition of the model. I will give an overview of the results and show how our findings apply to noisy transformers.
Deep Gaussian Processes are becoming increasingly popular, but their computation due to the nested structure is highly challenging and scales poorly with the sample size. A popular approach to speed up the GP computations is Vecchia approximations. We derive a new Vecchia approximation for deep GPs and provide theoretical guarantees for the method. Based on an ongoing joint work with Ismael Castillo (Sorbonne, Paris), Thibault Randrianarisoa (Vector Institute, Toronto) and Yichen Zhou (Bocconi, Milano).
We aim for statistical inference for random forests. To this end, we focus on the theoretically simpler to analyse centered purely random forests where the partition is independent of the observations and the splits are always centered in a randomly chosen direction. Taking into account that each tree in the forests is constructed only based on a subsample, we exploit U-process theory and a Gaussian approximation of the supremum of empirical processes. As a main result we construct uniform confidence bands for centered purely random forests.
The talk is based on joint work with Natalie Neumeyer and Jan Rabe.
We consider the problem of identifying predictive feature interactions that are locally relevant to a specific prediction. I present local LSSFind, a statistical framework for extracting localized interaction structures from tree ensemble models, with a focus on random forests. The method aggregates subsets of predictors that frequently co-occur along decision paths traversed by a given observation and quantifies their contribution to the prediction, enabling a form of local, interpretable inference. We demonstrate the utility of this approach using data from genome-wide association studies (GWAS), where the goal is to prioritize target genes based on commonly used genomic features.
The benign overfitting phenomenon for the linear regression model explores regimes in which the classical least squares estimator and variants such as ridge regression or gradient descent are consistent in prediction error, despite overfitting. In this talk, we show how the rates for benign overfitting compare to the statistically optimal rates.
Based on joint discussion with Johannes Schmidt-Hieber.
We consider the problem of nonparametric inference in multi-dimensional diffusion models from low-frequency data. Due to the computational intractability of the likelihood, implementation of likelihood-based procedures in such settings is a notoriously difficult task. Exploiting the underlying (parabolic) PDE structure of the transition densities, we derive computable formulas for the likelihood function and its gradients. We then construct a Metropolis-Hastings Crank-Nicolson-type algorithm for Bayesian inference with Gaussian priors, as well as gradient-based methods for computing the MLE and Langevin-type MCMC. The performance of the algorithms is illustrated via numerical experiments.
In this talk, we present new results on the sample size required to learn surrogates of nonlinear mappings between infinite-dimensional Hilbert spaces. Such surrogate models have a wide range of applications and can be used in uncertainty quantification and parameter estimation problems in fields such as classical mechanics, fluid mechanics, electrodynamics, earth sciences etc. Our analysis shows that, for certain neural network architectures, empirical risk minimization based on noisy input-output pairs can overcome the curse of dimensionality in this setting. Additionally, we provide a numerical comparison to other approaches including constructive methods not based on the learning from random data.