Friday, July 18: Meeting 118-120
All times in Vancouver, Canada
Poster size for workshops:
Portrait orientation, 24"w x 36"h (61cm"w x 91.5cm" h).
*Note this is different for the main conference
Abstract: Computational capabilities in large AI systems emerge from the collective dynamics and interactions of basic neural elements under simple optimization rules. What are the fundamental principles driving this? How do interactions, dynamics, and data structure conspire to yield these abilities? We’ll revisit a classic setting - that of learning word embeddings (such as word2vec) from natural language. We’ll show how the dynamics of such algorithms can be distilled into a simpler solvable theory. Such algorithms give rise to a well-known linear computational structure in their learned representations that enables analogical reasoning, one of the simplest examples of an “emergent” capability. We propose a solvable microscopic theory of how word correlations in language can arise from factorization in latent space and drive the emergence of this linear computational structure.
Abstract: This paper analyzes the generalization error of minimum-norm interpolating solutions in linear regression using spiked covariance data models. The paper characterizes how varying spike strengths and target-spike alignments affect risk, especially in overparameterized settings. The study presents an exact expression for the generalization error, leading to a comprehensive classification of benign, tempered, and catastrophic overfitting regimes based on spike strength, the aspect ratio $c=d/n$ (particularly as $c \to \infty$), and target alignment. Notably, in well-specified aligned problems, increasing spike strength can surprisingly induce catastrophic overfitting before achieving benign overfitting. The paper also reveals that target-spike alignment is not always advantageous, identifying specific, sometimes counterintuitive, conditions for its benefit or detriment. Alignment with the spike being detrimental is empirically demonstrated to persist in nonlinear models.
Abstract: Recent works have developed a comprehensive picture of gradient-based learning of isotropic Gaussian single-index models by developing computational lower bounds along with optimal algorithms. In this work, we demonstrate that the picture can change significantly when the data covariance is structured and contains some degree of information about the target. Through studying a spiked covariance model, we show that for the class of Correlational Statistical Query (CSQ) learners, a simple preconditioning of online SGD already achieves an almost optimal sample complexity. Unlike the isotropic case, further smoothing the landscape does not improve this complexity. We prove similar lower bounds in the Statistical Query (SQ) class, where we demonstrate a gap between the SQ lower bound and the performance of the algorithms that are known to be optimal in the isotropic setting. Finally, we show a stark contrast in the information-theoretic limit, where the tight lower bound goes through a sudden phase transition from d to 1 depending on covariance structure, where d is the dimension of the input. Overall, our analysis provides a clear characterization of when and how the spike simplifies learning by improving over isotropic covariance.
Abstract: We probe the local geometry of loss landscapes coming from high-dimensional Gaussian mixture classification tasks as seen from a typical training trajectory. We study this by proving limits for the empirical spectral distribution and locations of outliers, for the empirical Hessian at all points in parameter space. Notably, this limiting spectral theory only depends on the point in parameter space through some $O(1)$-many summary statistics of the parameter; these summary statistics also follow an autonomous evolution under training by stochastic gradient descent. This allows us to investigate the limiting spectral theory along the limiting training trajectory, and find interesting phenomena like splitting and emergence of outliers over the course of training. Based on joint works with G. Ben Arous, J. Huang, and A. Jagannath.
Abstract: I will present recent advances in the asymptotic analysis of empirical risk minimization in overparameterized neural networks trained on synthetic Gaussian data, focusing on two key extensions of classical models: depth and nonlinearity. The aim is to understand what different architectures can learn and to characterize their inductive biases.
In the first part, I derive sharp asymptotic formulas for two-layer networks with quadratic activations and show that they exhibit a bias toward sparse representations, such as multi-index models. By mapping the learning dynamics to a convex matrix sensing problem with nuclear norm regularization, we obtain precise generalization thresholds and identify the fundamental limits of recovery.
In the second part, I analyze deep architectures trained on hierarchical Gaussian targets and show that depth enables effective dimensionality reduction. This leads to significantly improved sample complexity compared to kernel or shallow methods. A key requirement is that the target function be compositional and robust to noise at each level.
These results draw on tools from random matrix theory, convex optimization, and statistical physics, and help delineate the regimes in which deep learning offers provable advantages in high-dimensional settings.
Abstract: What functions representable by neural nets are tractably learnable? Complexity results tell us that not all of them are, and we have been on a quest to understand what is the subclass of functions that are learnable. In this talk I will revisit and question this view, putting an emphasis on the input distribution, rather than the target function, and arguing that perhaps all functions are easy, it’s just a matter of the input distribution. This also leads to understanding much of the current success of deep learning in terms of “Positive Distribution Shift”.
Abstract: We introduce the attention-indexed model (AIM), a theoretical framework for analyzing learning in deep attention layers. Inspired by multi-index models, AIM captures how token-level outputs emerge from layered bilinear interactions over high-dimensional embeddings. Unlike prior tractable attention models, AIM allows full-width key and query matrices, aligning more closely with practical transformers. Using tools from statistical mechanics and random matrix theory, we derive closed-form predictions for Bayes-optimal generalization error and identify sharp phase transitions as a function of sample complexity, model width, and sequence length. We propose a matching approximate message passing algorithm and show that gradient descent can reach optimal performance. AIM offers a solvable playground for understanding learning in modern attention architectures.
Abstract: Theoretical efforts to prove advantages of Transformers in comparison with classical architectures such as feedforward and recurrent neural networks have mostly focused on representational power. In this work, we take an alternative perspective and prove that even with infinite compute, feedforward and recurrent networks may suffer from larger sample complexity compared to Transformers, as the latter can adapt to a form of dynamic sparsity. Specifically, we consider a sequence-to-sequence data generating model on sequences of length $N$, where the output at each position only depends on $q \ll N$ relevant tokens, and the positions of these tokens are described in the input prompt. We prove that a single-layer Transformer can learn this model if and only if its number of attention heads is at least $q$, in which case it achieves a sample complexity almost independent of $N$, while recurrent networks require $N^{\Omega(1)}$ samples on the same problem. If we simplify this model, recurrent networks may achieve a complexity almost independent of $N$, while feedforward networks still require $N$ samples. Our proposed sparse retrieval model illustrates a natural hierarchy in sample complexity across these architectures.
Abstract: Deep neural networks learn structured features by exploiting higher-order correlations in their inputs. How neural networks achieve this computationally hard task remains an open question. Here, we study feature learning in Independent component analysis (ICA), a simple unsupervised method that learns filters resembling those of deep convolutional neural networks. We first prove that FastICA, the most popular ICA algorithm, requires quartic sample complexity to recover a non-Gaussian direction from high-dimensional inputs. In contrast, we show that online SGD outperforms FastICA and even reaches the optimal quadratic sample complexity when smoothing the loss landscape. We then demonstrate the existence of a search phase for FastICA on ImageNet and show that FastICA recovers a non-Gaussian direction if we first project down to a finite number of principal components. We rationalise these experimental results in a synthetic ``subspace model'', where we prove that ICA recovers the non-Gaussian direction embedded in the principal subspace at linear sample complexity. We conclude by discussing how this picture extends to deep convolutional networks trained on ImageNet.
Abstract: In many applications of statistical estimation via sampling, one may wish to sample from a high-dimensional target distribution that is adaptively evolving to the samples already seen. We study an example of such dynamics, given by a Langevin diffusion for posterior sampling in a Bayesian linear regression model with i.i.d. regression design, whose prior continuously adapts to the Langevin trajectory via a maximum marginal-likelihood scheme. Using techniques of dynamical mean-field theory (DMFT), we provide a precise characterization of a high-dimensional asymptotic limit for the joint evolution of the prior parameter and law of the Langevin sample. We then carry out an analysis of the equations that describe this DMFT limit, under conditions of approximate time-translation-invariance which include, in particular, settings where the posterior law satisfies a log-Sobolev inequality. In such settings, we show that this adaptive Langevin trajectory converges on a dimension-independent time horizon to an equilibrium state that is characterized by a system of replica-symmetric fixed-point equations, and the associated prior parameter converges to a critical point of a replica-symmetric limit for the model free energy. We explore the nature of the free energy landscape and its critical points in a few simple examples, where such critical points may or may not be unique.
This is joint work with Justin Ko, Bruno Loureiro, Yue M. Lu, and Yandi Shen.