Friday, July 28: Room 315
All times in Honolulu, Hawaii
Poster size (for workshops): Portrait orientation, 24"w x 36"h (61 cm w x 91.5 cm h).
*Note this is different for the main conference
Abstract: We study the effect of gradient-based optimization on feature learning in two-layer neural networks. We consider a setting where the number of samples is of the same order as the input dimension and show that, when the input data is isotropic, gradient descent always improves upon the initial random features model in terms of prediction risk, for a certain class of targets. Further leveraging the practical observation that data often contains additional structure, i.e., the input covariance has non-trivial alignment with the target, we prove that the class of learnable targets can be significantly extended, demonstrating a clear separation between kernel methods and two-layer neural networks in this regime.
Mengqi Lou (Georgia Tech), Sharp predictions for mini-batched prox-linear iterations in rank one matrix sensing
Zhichao Wang (UC San Diego), Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective
Abstract: We consider the problem of estimating the factors of a rank-1 matrix with i.i.d. Gaussian, rank-1 measurements that are corrupted by noise. We derive a deterministic, low dimensional recursion which predicts the error of a mini-batched prox-linear iteration for the nonconvex least squares loss associated to this statistical model. Our guarantees are fully non-asymptotic and are accurate for any batch size m ≥ 1 for a large range of step-sizes. Moreover, and in contrast with previous work, we show that the magnitude of the fluctuations of the empirical iterates around our deterministic predictions adapt to the problem error. Finally, we demonstrate how to use our deterministic predictions to perform hyperparameter tuning (e.g. step-size and minibatch size selection) without ever running the method.
Abstract: We consider the learning of a single-index target function $f_*: \mathbb{R}^d\to\mathbb{R}$ under spiked covariance data: $f_*(\boldsymbol{x}) = \textstyle\sigma_*(\frac{1}{\sqrt{1+\theta}}\langle\boldsymbol{x},\boldsymbol{\mu}\rangle)$, $\boldsymbol{x}\overset{\small\mathrm{i.i.d.}}{\sim}\mathcal{N}(0,\boldsymbol{I_d} + \theta\boldsymbol{\mu}\boldsymbol{\mu}^\top),$ where the link function $\sigma_*:\mathbb{R}\to\mathbb{R}$ is a degree-$p$ polynomial with information exponent $k$ (defined as the lowest degree in the Hermite expansion of $\sigma_*$), and it depends on the projection of input $\boldsymbol{x}$ onto the spike (signal) direction $\boldsymbol{\mu}\in\mathbb{R}^d$. In the proportional asymptotic limit where the number of training examples $n$ and the dimensionality $d$ jointly diverge: $n,d\to\infty, d/n\to\gamma\in(0,\infty)$, we ask the following question: how large should the spike magnitude $\theta$ (i.e., strength of the low-dimensional component) be, in order for $(i)$ kernel methods, $(ii)$ neural network trained with gradient descent, to learn $f_*$? We show that for kernel ridge regression, $\theta = \Omega\big(d^{1-\frac{1}{p}}\big)$ is both sufficient and necessary. Whereas for GD-trained two-layer neural network, $\theta=\Omega\big(d^{1-\frac{1}{k}}\big)$ suffices. Our result demonstrates that both kernel method and neural network benefit from low-dimensional structures in the data; moreover, since $k\le p$ by definition, neural network can adapt to such structure more effectively.
Abstract: The first half of the talk surveys recent results in analyzing optimization methods in deep learning, specifically how different update methods affect the quality of solution produced. This study of "implicit bias of training" has led to some quantification of the effect of normalization, weight decay, learning rates in diverse architectures, as well as of the efficacy of local updates in a distributed environment ("Local SGD").
The second half of the talk surveys the nascent field of applying optimization ideas to large AI models. The nature and size of these models leads to new phenomena which motivate new research directions ---especially related to fine-tuning and in-context learning. Recent results highlight the power of a forward pass of today's large models. The first result shows that in context of fine-tuning, zero'th order optimization (doable with forward pass only) can be competitive with usual 1st order optimization. The second result shows that in-context learning (i.e., learning during a single forward pass) can be more powerful than hitherto believed, since it allows the LLM's forward pass to even fine-tune a fairly large "baby transformer" hidden inside.
Abstract: A central goal in neuroscience is to understand how orchestrated computations in the brain arise from the properties of single neurons and networks of such neurons. Answering this question requires theoretical advances that shine a light on the ‘black box’ of representations in neural circuits. In this talk, we will demonstrate theoretical approaches that help describe how cognitive task implementations emerge from the structure in neural populations and from biologically plausible neural networks.
We will introduce a new statistical mechanical theory that connects geometric structures that arise from neural responses (i.e., neural manifolds) to the neural representation’s efficiency in implementing a task. In particular, this theory describes how many neural manifolds can be represented (or ‘packed’) in the neural activity space while they can be linearly decoded by a downstream readout neuron. The intuition from this theory is remarkably simple: like a sphere packing problem in physical space, we can encode many “neural manifolds” into the neural activity space if these manifolds are small and low-dimensional, and vice versa.
Next, we will describe how such an approach can, in fact, open the ‘black box’ of distributed neuronal circuits in a range of settings, such as experimental neural datasets and artificial neural networks. In particular, our method overcomes the limitations of traditional dimensionality reduction techniques, as it operates directly on the high-dimensional representations rather than relying on low-dimensionality assumptions for visualization. Furthermore, this method allows for simultaneous multi-level analysis, by measuring geometric properties in neural population data and estimating the amount of task information embedded in the same population. This geometric approach is general and can be used across different brain areas and task modalities, as demonstrated in our works and others.
Finally, we will discuss our recent efforts to fully extend this multi-level description of neural populations by (1) investigating how single neuron properties shape the representation geometry in early sensory areas and by (2) understanding how task-implementing neural manifolds emerge in downstream areas in biological and artificial networks. By extending our mathematical toolkit for analyzing representations underlying complex neuronal networks, we hope to contribute to the long-term challenge of understanding the neuronal basis of tasks and behaviors.
Bio: SueYeon Chung is an Assistant Professor in the Center for Neural Science at NYU, with a joint appointment in the Center for Computational Neuroscience at the Flatiron Institute. She is also an affiliated faculty member at the Center for Data Science and Cognition and Perception Program at NYU. Prior to joining NYU, she was a Postdoctoral Fellow in the Center for Theoretical Neuroscience at Columbia University, and BCS Fellow in Computation at MIT. Before that, she received a Ph.D. in applied physics at Harvard University, and a B.A. in mathematics and physics at Cornell University.
Simon Du (Washington), Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron
Wei Huang (RIKEN, AIP), Graph Neural Networks Provably Benefit from Structural Information: A Feature Learning Perspective
Yuandong Tian (Meta), Scan and Snap: Understanding Training Dynamics and Token Composition in 1-layer Transformer
Abstract: We revisit the problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has $n\ge 2$ neurons. We prove the global convergence of randomly initialized gradient descent with a $O\left(T^{-3}\right)$ rate. This is the first global convergence result for this problem beyond the exact-parameterization setting ($n=1$) in which the gradient descent enjoys an $\exp(-\Omega(T))$ rate. Perhaps surprisingly, we further present an $\Omega\left(T^{-3}\right)$ lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that over-parameterization can exponentially slow down the convergence rate. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a three-phase structure to analyze GD's dynamics. Along the way, we prove gradient descent automatically balances student neurons, and use this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.
Abstract: Graph neural networks (GNNs) have pioneered advancements in graph representation learning, exhibiting superior feature learning and performance over multilayer perceptrons (MLPs) when handling graph inputs. However, understanding the feature learning aspect of GNNs is still in its initial stage. This study aims to bridge this gap by investigating the role of graph convolution within the context of feature learning theory in neural networks using gradient descent training. We provide a distinct characterization of signal learning and noise memorization in two-layer graph convolutional networks (GCNs), contrasting them with two-layer convolutional neural networks (CNNs). Our findings reveal that graph convolution significantly augments the benign overfitting regime over the counterpart CNNs, where signal learning surpasses noise memorization, by approximately factor $\sqrt{D}^{q-2}$, with $D$ denoting a node's expected degree and $q$ being the power of the ReLU activation function where $q > 2$. These findings highlight a substantial discrepancy between GNNs and MLPs in terms of feature learning and generalization capacity after gradient descent training, a conclusion further substantiated by our empirical simulations.
Abstract: Transformer architectures have shown impressive performance in multiple research domains and have become the backbone of many neural network models. However, there is limited understanding on how Transformer works. In particular, with a simple predictive loss, how the representation emerges from the gradient training dynamics remains a mystery. In this paper, we analyze the SGD training dynamics for 1-layer transformer with one self-attention plus one decoder layer, for the task of next token prediction in a mathematically rigorous manner. We open the black box of the dynamic process of how the self-attention layer combines input tokens, and reveal the nature of underlying inductive bias. More specifically, with the assumption (a) no positional encoding, (b) long input sequence, and (c) the decoder layer learns faster than the self-attention layer, we prove that self-attention acts as a discriminative scanning algorithm: starting from uniform attention, it gradually attends more to key tokens that are distinct for a specific next token to be predicted, and pays less attention to common key tokens that occur across different next tokens. Among distinct tokens, it progressively drops attention weights, following the order of low to high co-occurrence between the key and the query token in the training set. Interestingly, this procedure does not lead to winner-takes-all, but decelerates due to a phase transition that is controllable by the learning rates of the two layers, leaving (almost) fixed token combination. We verify this scan and snap dynamics on synthetic and real-world data (WikiText).
Abstract: We reveal a strong implicit bias of stochastic gradient descent (SGD) that drives initially expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization. To reveal this bias, we identify invariant sets, or subsets of parameter space that trap SGD dynamics once entered. We further establish sufficient conditions for stochastic attractivity to these simpler invariant sets based on a competition between the loss landscape's curvature around the invariant set and the noise introduced by stochastic gradients. Remarkably, we find that an increased level of SGD noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss. We further demonstrate how this simplifying process of stochastic collapse benefits generalization in a linear teacher-student framework. Our analysis also mechanistically explains why early training with large learning rates for extended periods benefits subsequent generalization by promoting stochastic collapse. Finally we empirically demonstrate the strong effect of stochastic collapse in benchmark architectures and datasets, revealing surprisingly large groups of redundant neurons with identical incoming and outgoing weights after training, due to attractive invariant sets associated with permutation symmetry.
Joint with with Feng Chen*, Daniel Kunin* and Atushi Yamamura* (* equal authors): https://arxiv.org/abs/2306.04251
Abstract: Modern machine learning models are often overparametrized: they are rich enough to perfectly interpolate the training data, even if these are pure noise. The optimization landscape of such problems is still poorly understood: while it is expected that sufficiently overparametrized systems are easy to optimize, it has proven challenging to formalize this intuition. I will introduce a simple toy model for these optimization landscapes, and present characterizations of several algorithms for this model. I will then discuss differences and analogies between this model and more realistic neural networks.