Title: Learning Space Group Invariant Functions
Abstract: The plane and space groups are groups that specify how to tile two- or three-dimensional Euclidean space with a shape: They enumerate all possible ways in which a shape can be isometrically replicated across the space. I will describe how to explicitly compute approximate eigenfunctions of the Laplace-Beltrami operator on the orbifold defined by any such group. These eigenfunctions provide a complete L2 basis of all functions on two- or three-dimensional space that are (i) continuous and (ii) periodic with respect to the group. The basis allows us to represent functions that arise as quantum observables of crystalline solids or in mechanical meta-materials, to generate random functions respecting the group symmetry, and to compute a Fourier transform defined by the group. I will also explain how to construct an approximation to the orbifold in a higher-dimensional space and a map onto this embedding. Composing this map with function representations used in machine learning (say a neural network or kernel function) results in machine learning models that respect the group symmetry. This is joint work with Peter Orbanz.
Title: Inferential Engines
Title: Asymptotic scaling of chips and clusters for large model training
Title: Symbolic Distillation of Neural Networks
Title: Harnessing Insights from Neuroscience for More Powerful and Efficient Machine Learning
Abstract: Despite the success of artificial neural networks (ANNs) based on the error backpropagation algorithm, it is uncertain whether they can achieve AGI simply by scaling up current approaches. Additionally, ANNs only resemble the human brain superficially, suggesting that there is another path to AGI: reverse engineering biological neurons and formalizing their operation in mathematical algorithms which can be implemented in ANNs and scaled up. Motivated by the non-trivial temporal dynamics in biological neurons and local learning rules we are developing such an alternative NeuroAI framework. The algorithms we developed solve important unsupervised learning tasks such as dimensionality reduction, clustering, manifold learning, canonical correlation analysis etc.
Title: Can a computer judge interestingness?
Abstract: Creating new mathematics is one of the original AI challenge problems. Mathematics is made up of interesting provable statements about numbers, geometry and other structures. For roughly a century we have had a framework, mathematical logic, which turns proof into a formal process. By contrast interestingness remains mysterious, a matter of intuition. There is very little work in cognitive science or AI on the topic. In this talk we discuss ways to formalize interestingness and implement it on a computer.
Title: Bayesian Interpolation with Deep Linear Networks
Abstract: This talk gives new exact formulas for Bayesian posteriors in deep linear networks. The results, joint with Alexander Zlokapa (MIT Physics), reveal precisely how the input dimension, number of training datapoints, network width and network depth affect the structure of posterior predictions and (Bayesian) model selection. After providing some general motivation, I will focus on explaining results of two kinds. First, I will state precise theorems proving that infinitely deep linear networks do optimal feature learning. Specifically, these results show that when equipped with universal, data-agnostic priors the resulting Bayesian posterior is exactly the same as the posterior obtained by using finite depth networks but with data-dependent priors that maximize the Bayesian model evidence. Second, I will discuss a new rigorous scaling law for L_2 generalization error, giving power-law relations between network depth, network width, and the number of training datapoints.
Title: Deep learning as a last resort
Abstract: For the last 10 years, we have seen a rapid adoption of deep learning techniques across many disciplines, ranging from self-driving vehicles, credit card rating to biomedicine. Along with this wave, we have seen rapid adoption and rejection in the nascent field of Machine Learning and Sciences. While we see more and more people working in the area of Machine Learning and Sciences, there are also quite a number of skeptics (sometimes for very good reasons). Some of us are believers of using deep learning as a last resort, and I will showcase a few of these scientific challenges ranging from understanding our Universe, the Milky Way, the Solar System to our genome.
Title: Implicit Bias of Large Depth Networks: the Bottleneck Rank
Abstract: Several neural network models are known to be biased towards some notion of sparsity: minimizing rank in linear networks or minimizing the effective number of neurons in the hidden layer of a shallow neural network. I will argue that the correct notion of sparsity for large depth DNNs is the so-called Bottleneck (BN) rank of a (piecewise linear) function $f$, which is the smallest integer $k$ such that there is a factorization $f=g\circ h$ with inner dimension $k$. First, the representation cost of DNNs converges as the depth goes to infinity to the BN rank over a large family of functions. Second, for sufficiently large depths, the global minima of the $L_{2}$-regularized loss of DNNs are approximately BN-rank 1, in the sense that there is a hidden layer whose representation of the data is approximately one dimensional. When fitting a true function with BN-rank $k$, the global minimizers recover the true rank if $k=1$. If $k>1$ results suggest that the true rank is recovered for intermediate depths. BN-rank minimization leads autoencoders to be naturally denoising, and classifiers to feature specific topological properties. Both of these phenomena are observed empirically in large depths networks.
Title: AI Safety and Self-Supervision
Title: Towards Understanding Adversarial Robustness
Title: Modern Hopfield Networks for Novel Transformer Architectures
Abstract: Modern Hopfield Networks or Dense Associative Memories are recurrent neural networks with fixed point attractor states that are described by an energy function. In contrast to conventional Hopfield Networks, which were popular in the 1980s, their modern versions have a very large memory storage capacity, which makes them appealing tools for many problems in machine learning and cognitive and neuro-sciences. In this talk I will introduce an intuition and a mathematical formulation of this class of models, and will give examples of problems in AI that can be tackled using these new ideas. Particularly, I will introduce an architecture called Energy Transformer, which replaces the conventional attention mechanism with a recurrent Dense Associative Memory model. I will explain the theoretical principles behind this architectural choice and show promising empirical results on challenging computer vision and graph network tasks.
Title: Calibration in Deep Learning: Theory and Practice
Title: Adaptive Kernel Approaches to Feature Learning in Deep Neural Networks
Abstract: Following their ever-increasing role in our world, a better understanding of Deep Neural Networks (DNNs) is clearly desirable. Progress in this direction has been seen lately in the realm of infinitely overparameterized DNNs. The outputs of such trained DNNs behave essentially as multivariate Gaussians governed by a certain covariance matrix called the kernel. While such infinite DNNs share many similarities with the finite ones used in practice, various important discrepancies exist. Most notably the fixed kernels of such DNNs stand in contrast to feature learning effects observed in finite DNNs. To accommodate such effects within the Gaussian/kernel viewpoint, various ideas have been put forward. Here I will give a short overview of those efforts and then discuss a general Gaussian framework for feature learning in fully trained/equilibrated CNNs and FCNs. Interestingly, DNNs accommodate strong feature learning via mean-field effects while having decoupled layers and decoupled neurons within a layer. Furthermore, learning is not about the compression of information but about amplifying neuron variance along label-relevant directions. Lastly, this viewpoint suggests new ways of reverse engineering features in the wild.
https://www.nature.com/articles/s41467-023-36361-y
Title: Thermodynamic Description of Feature Learning in Deep Neural Networks
Abstract: Deep neural networks (DNNs) are powerful tools for compressing and distilling information. Their scale and complexity, often involving billions of inter-dependent parameters, render direct microscopic analysis difficult. Under such circumstances, a common strategy is to identify slow variables that average the erratic behaviour of the fast microscopic variables. In this talk, I will present a novel mean-field theory for finite fully trained non-linear DNNs. The theory allows us to identify such separation of scales occurring in deep convolutional neural networks (CNNs) and fully connected networks (FCNs). Specifically, we show that DNN layers couple only through the second cumulant (kernels) of their activations and pre-activations. Moreover, the latter fluctuates in a nearly Gaussian manner. For infinite width DNNs, these kernels are inert, while for finite width they adapt to the data and yield a tractable data-aware Gaussian Process. The resulting thermodynamic theory of deep learning yields accurate predictions in various settings. In addition, it provides new ways of analysing and understanding DNNs in general. This is joint work with Gadi Naveh and Zohar Ringel, for more information see https://arxiv.org/pdf/2112.15383.pdf.
Title: Energy-Conserving Hamiltonian Dynamics for Predictably Improved Optimization and Sampling
Abstract: In physics terms, standard gradient-based optimization and sampling methods employ frictional and/or thermal dynamics on a potential landscape. We find predictable advantages over this in a distinct class of algorithms based instead on energy-conserving chaotic Hamiltonian dynamics. Appropriate frictionless Hamiltonians such as relativistic speed-limited motion or Newtonian dynamics with loss-dependent mass proceed unobstructed until slowing naturally near the minimal loss, which dominates the phase space volume of the system. In new work, we engineer a systematic enhancement of this effect with a simple family of Hamiltonians and illustrate its practical improvements over traditional methods. For sampling, we reverse engineer Hamiltonians whose phase space measure gives the desired target distribution, given ergodicity introduced by random bounces spaced according to the dimension-dependent extent of the typical set of the distribution. The resulting algorithm, Microcanonical Hamiltonian Monte Carlo (MCHMC) consistently outperforms the state of the art (NUTS HMC) on several standard benchmark problems, in some cases by more than an order of magnitude. Based on followups of https://proceedings.mlr.press/v162/de-luca22a.html with De Luca and Gatti and https://arxiv.org/abs/2212.08549 with Robnik, De Luca and Seljak.
Title: Can numerical analysis explain the generalization benefit of SGD?
Abstract: SGD is one of the key tools behind recent progress in deep learning. A number of authors have observed empirically that small batch sizes and large learning rates enhance generalization on popular computer vision tasks, yet this phenomenon remains poorly understood. To address this problem, I will introduce Backward Error Analysis, a powerful tool for analyzing approximation error in numerical integrators, and apply it to prove that finite learning rates introduce a bias between the paths taken by gradient descent and gradient flow. Remarkably, we can extend this analysis to capture the bias introduced by mini-batch SGD with finite learning rates. Our analysis predicts that SGD introduces an implicit regularization term proportional to the trace of the covariance matrix of the per-example gradients, whose scale is set by the ratio of the learning rate divided by the batch size. Numerical experiments verify that explicitly regularizing the trace of the covariance matrix of the per-example gradients significantly enhances generalization on popular computer vision tasks, enabling us to close the generalization gap between small and large learning rate SGD.
Title: A Solvable Model of Neural Scaling Laws
Abstract: Large language models have been empirically shown to obey neural scaling laws that predict their performance as a function of parameters and dataset size. I will discuss the properties that allow scaling laws to arise and propose a statistical model that captures this scaling phenomenology. By solving this model in the limit of large training set size and large number of parameters, I will give insight into (i) the statistical structure of datasets and tasks that lead to scaling laws, (ii) the way nonlinear feature maps, such as those provided by neural networks, enable scaling laws when, and (iii) how such scaling laws can break down.
Title: Towards Understanding the Implicit Biases of Adaptive Optimization
Abstract: This talk covers two recent works from our group in Apple Machine Learning Research that investigate adaptive optimizers for training neural networks. In the first part of the talk, we present empirical work that uncovers an optimization anomaly plaguing adaptive optimizers at extremely late stages of training, referred to as the Slingshot Mechanism. A prominent artifact of the Slingshot Mechanism can be measured by the cyclic phase transitions between stable and unstable training regimes, and can be easily monitored by the cyclic behavior of the norm of the last layers weights. These slingshot effects of the model weights are often accompanied by a jump in generalization accuracy, suggesting a surprising and useful inductive bias of adaptive gradient optimizers. In the second part of the talk, we present theoretical tools for understanding adaptive optimizers in the infinite-width limit. Concretely, we generalize the Tensor Program framework and derive all infinite width limits of neural networks trained with adaptive optimizers, along with their corresponding parameterizations. Specifically, we show that while the “NTK” limit is not a kernel limit in the traditional sense, it fails to learn features. This however is fixed by an alternative parameterization which allows feature learning in the infinite width limit.
Title: Learning Uncertainties the Frequentist Way
Abstract: Uncertainty quantification is a hot topic in machine learning research, but the precise type of "uncertainty" being quantified is not always clear. In this talk, I highlight different types of uncertainties that arise in the context of particle physics, and I explain why special care must be taken when using machine learning for frequentist statistics. Focusing on the task of simulation-based calibration, I then introduce a new machine-learning-based method to quantify the "resolution" of a detector. Using public collider simulations from the CMS experiment, I demonstrate how this technique can achieve improved jet energy resolutions compared to traditional methods, with minimal additional computational overhead.
Title: From Minerva to Autoformalization: A Path Towards Math Intelligence
Title: The unreasonable effectiveness of mathematics in large scale deep learning
Abstract: Recently, the theory of infinite-width neural networks led to the first technology, muTransfer, for tuning enormous neural networks that are too expensive to train more than once. For example, this allowed us to tune the 6.7 billion parameter version of GPT-3 using only 7% of its pretraining compute budget, and with some asterisks, we get a performance cmparable to the original GPT-3 model with twice the parameter count. In this talk, I will explain the core insight behind this theory. In fact, this is an instance of what I call the *Optimal Scaling Thesis*, which connects infinite-size limits for general notions of “size” to the optimal design of large models in practice, illustrating a way for theory to reliably guide the future of AI. I'll end with several concrete key mathematical research questions whose resolutions will have incredible impact on how practitioners scale up their NNs.