The Schedule

The schedule is also available on the NeurIPS virtual platform.
Accepted papers are available on OpenReivew.

8:50 a.m. - 9:00 a.m.

Opening Remarks

9:00 a.m. - 9:45 a.m.

Invited Talk: From algorithms to neural networks and back

Speaker: Andrej Risteski

Abstract: An increasingly common design and analysis paradigm for neural networks is thinking of them as parametrizing (implicitly or explicitly) some algorithm. In images, score-based generative models can be thought of as parametrizing a learned sampler (a stochastic differential equation or a Markov Chain). In scientific applications, PDE solvers are trained as neural analogues of numerical solvers. In language, we probe to understand whether transformers can solve simple algorithmic tasks like parsing. In this talk, I’ll share several vignettes illustrating the value of an algorithmic lens in these settings: namely, understanding the performance of “natural” algorithms will allow us to understand the performance of neural methods, as well as explore and elucidate the architectural design space.

9:45 a.m. - 10:30 a.m.

Invited Talk: How do two-layer neural networks learn complex functions from data over time?

Speaker: Florent Krzakala

Abstract: How do two-layer neural networks learn complex functions from data over time? In this talk, we shall delve into the interaction between batch size, number of iterations, and task complexity, shedding light on neural network adaptation to data features. I will particularly highlight three key findings:

The significant impact of a single gradient step on the feature learning, emphasizing the relationship between batch size and the target's information exponent (or complexity).
The enhancement of the network's approximation ability over multiple gradient steps, enabling the learning of more intricate functions over time.
The improvement in generalization compared to the basic random feature/kernel regime.

Our theoretical approach combines techniques from statistical physics, concentration of measure, projection-based conditioning, and Gaussian equivalence, which we believe holds standalone significance.

Based on joint work with Yatin Dandi, Bruno Loureiro, Luca Pesce, and Ludovic Stephan (https://arxiv.org/pdf/2305.18270.pdf)

10:30 a.m. - 10:40 a.m.

Oral: Feature Learning in Infinite-Depth Neural Networks

Greg Yang · Dingli Yu · Chen Zhu · Soufiane Hayou

10:40 a.m. - 10:50 a.m.

Oral: Fit Like You Sample: Sample-Efficient Score Matching From Fast Mixing Diffusions

Yilong Qin · Andrej Risteski

10:50 a.m. - 11:00 a.m.

Oral: Deep Networks as Denoising Algorithms: Sample-Efficient Learning of Diffusion Models in High-Dimensional Graphical Models

Song Mei · Yuchen Wu

11:00 a.m. - 11:10 a.m.

Oral: Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data

Zhiwei Xu · Yutong Wang · Spencer Frei · Gal Vardi · Wei Hu

11:10 a.m. - 12:10 p.m.

Poster Session 1

12:10 p.m. - 1:15 p.m.

Lunch Break

1:15 p.m. - 2:00 p.m.

Invited Talk: Benefits of learning with symmetries: eigenvectors, graph representations and sample complexity

Speaker: Stefanie Jegelka

Abstract: In many applications, especially in the sciences, data and tasks have known invariances. Encoding such invariances directly into a machine learning model can improve learning outcomes, while it also poses challenges on efficient model design.

In the first part of the talk, we will focus on the invariances relevant to eigenvectors and eigenspaces being inputs to a neural network. Such inputs are important, for instance, for graph representation learning or orthogonally equivariant learning. We will discuss targeted architectures that can universally express functions with the relevant invariances or equivariances - sign flips and changes of basis - and their theoretical and empirical benefits.

Second, we will take a broader theoretical perspective. Empirically, it is known that encoding invariances into the machine learning model can reduce sample complexity. For the simplified setting of kernel ridge regression or random features, we will discuss new bounds that illustrate two ways in which invariances can reduce sample complexity. Our results hold for learning on manifolds and for invariances to a wide range of group actions.

This talk is based on joint work with Joshua Robinson, Derek Lim, Behrooz Tahmasebi, Lingxiao Zhao, Tess Smidt, Suvrit Sra and Haggai Maron.

2:00 p.m. - 2:15 p.m.

Break

2:15 p.m. - 3:00 p.m.

Invited Talk: Adaptivity in Domain Adaptation and Friends

Speaker: Samory Kpotufe

Abstract: Domain adaptation, transfer, multitask, meta, few-shots, or lifelong learning … these are all important recent directions in ML that all touch at the core of what we might mean by ‘AI’. As these directions all concern learning in heterogeneous and ever-changing environments, they all share a central question: what information a 'source' distribution may have about a 'target' distribution, or put differently, which measures of discrepancy between distributions properly model such information.

Our understanding of this central question is still rather fledgeling, with both positive and negative results. On one hand we show that traditional notions of distance and divergence between distributions (e.g., Wasserstein, TV, KL, Renyi) are in fact too conservative: a source may be 'far' from a target under such traditional notions, yet still admit much useful information about the target distribution. We then turn to the existence of 'adaptive' procedures, i.e., procedures which can optimally leverage such information in the source data without any prior distributional knowledge. Here the picture is quite nuanced: while various existing approaches turn out to be adaptive in usual settings with a single source and hypothesis class, no procedure can guarantee optimal rates adaptively in more general settings, e.g., settings with multiple source datasets (as in multitask learning), or settings with multiple hypothesis classes (as in model selection or hyper-parameter tuning).

Such negative results raise new questions, as they suggest that domain adaptation and related problems may benefit from more structure in practice than captured by current formalisms.

The talk is based on joint work with collaborators over the last few years, namely, G. Martinet, S. Hanneke, J. Suk, Y. Mahdaviyeh, N. Galbraith.

3:00 p.m. - 3:10 p.m.

Oral: Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

Blake Bordelon · Lornzo Noci · Mufan Li · Boris Hanin · Cengiz Pehlevan

3:10 p.m. - 3:20 p.m.

Oral: Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP

Zixiang Chen · Yihe Deng · Yuanzhi Li · Quanquan Gu

3:20 p.m. - 3:30 p.m.

Oral: In-Context Convergence of Transformers

Yu Huang · Yuan Cheng · Yingbin Liang

3:30 p.m. - 3:40 p.m.

Oral: Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study

Prin Phunyaphibarn · Junghyun Lee · Bohan Wang · Huishuai Zhang · Chulhee Yun

3:40 p.m. - 3:50 p.m.

Oral: Linear attention is (maybe) all you need (to understand transformer optimization)

Kwangjun Ahn · Xiang Cheng · Minhak Song · Chulhee Yun · Ali Jadbabaie · Suvrit Sra

3:50 p.m. - 4:00 p.m.

Closing Remarks

4:00 p.m. - 5:00 p.m.

Poster Session 2

List of Papers in Poster Session 1

A PAC-Bayesian Perspective on the Interpolating Information Criterion
Graph Neural Networks Benefit from Structural Information Provably: A Feature Learning Perspective
Linear attention is (maybe) all you need (to understand transformer optimization)
Large Catapults in Momentum Gradient Descent with Warmup: An Empirical Study
Feature Learning in Infinite-Depth Neural Networks
Variational Classification
Implicit biases in multitask and continual learningfrom a backward error analysis perspective
Spectrum Extraction and Clipping for Implicitly Linear Layers
The Noise Geometry of Stochastic Gradient Descent: A Quantitative and Analytical Characterization
Curvature-Dimension Tradeoff for Generalization in Hyperbolic Space
Complexity Matters: Dynamics of Feature Learning in the Presence of Spurious Correlations
Unveiling the Hessian's Connection to the Decision Boundary
Nonparametric Classification on Low Dimensional Manifolds using Overparameterized Convolutional Residual Networks
Large Learning Rates Improve Generalization: But How Large Are We Talking About?
Understanding the Role of Noisy Statistics in the Regularization Effect of Batch Normalization
Generalization Guarantees of Deep ResNets in the Mean-Field Regime
Theoretical Explanation for Generalization from Adversarial Perturbations
In-Context Convergence of Transformers
How Two-Layer Neural Networks Learn, One (Giant) Step at a Time
Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States
Unraveling the Complexities of Simplicity Bias: Mitigating and Amplifying Factors
Transformers as Support Vector Machines
Symmetric Mean-field Langevin Dynamics for Distributional Minimax Problems
A Theoretical Study of Dataset Distillation
Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models
Introducing an Improved Information-Theoretic Measure of Predictive Uncertainty
In-Context Learning on Unstructured Data: Softmax Attention as a Mixture of Experts
Attention-Only Transformers and Implementing MLPs with Attention Heads
Privacy at Interpolation: Precise Analysis for Random and NTK Features
Denoising Low-Rank Data Under Distribution Shift: Double Descent and Data Augmentation
A Theory of Non-Linear Feature Learning with One Gradient Step in Two-Layer Neural Networks
Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data
How does Gradient Descent Learn Features --- A Local Analysis for Regularized Two-Layer Neural Networks
Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP
Provably Efficient CVaR RL in Low-rank MDPs
Analysis of Task Transferability in Large Pre-trained Classifiers
On Scale-Invariant Sharpness Measures
Gibbs-Based Information Criteria and the Over-Parameterized Regime
Grokking modular arithmetic can be explained by margin maximization

List of Papers in Poster Session 2

Over-parameterised Shallow Neural Networks with Asymmetrical Node Scaling: Global Convergence Guarantees and Feature Learning
On the Computational Complexity of Inverting Generative Models
Flow-Based High-Dimensionally Distributional Robust Optimization
Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining
How Do Transformers Learn In-Context Beyond Simple Functions? A Case Study on Learning with Representations
A Theoretical Explanation of Deep RL Performance in Stochastic Environments
Deep Networks as Denoising Algorithms: Sample-Efficient Learning of Diffusion Models in High-Dimensional Graphical Models
Under-Parameterized Double Descent for Ridge Regularized Least Squares Denoising of Data on a Line
Continual Learning for Long-Tailed Recognition: Bridging the Gap in Theory and Practice
SimVAE: Narrowing the gap between Discriminative & Generative Representation Learning
Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks
Benign Oscillation of Stochastic Gradient Descent with Large Learning Rate
On Compositionality and Emergence in Physical Systems Generativie Modeling
Escaping Random Teacher Initialization Enhances Signal Propagation and Representations
The Expressive Power of Transformers with Chain of Thought
Transformers as Multi-Task Feature Selectors: Generalization Analysis of In-Context Learning
Fit Like You Sample: Sample-Efficient Score Matching From Fast Mixing Diffusions
Towards the Fundamental Limits of Knowledge Transfer over Finite Domains
Outliers with Opposing Signals Have an Outsized Effect on Neural Network Optimization
MoXCo:How I learned to stop exploring and love my local minima?
First-order ANIL provably learns representations despite overparametrisation
A Data-Driven Measure of Relative Uncertainty for Misclassification Detection
Non-Vacuous Generalization Bounds for Large Language Models
Learning from setbacks: the impact of adversarial initialization on generalization performance
Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit
Estimating optimal PAC-Bayes bounds with Hamiltonian Monte Carlo
Divergence at the Interpolation Threshold: Identifying, Interpreting & Ablating the Sources of a Deep Learning Puzzle
Good regularity creates large learning rate implicit biases: edge of stability, balancing, and catapult
Toward Student-oriented Teacher Network Training for Knowledge Distillation
Adaptive Sharpness-Aware Pruning for Robust Sparse Networks
Invariant Low-Dimensional Subspaces in Gradient Descent for Learning Deep Matrix Factorizations
How Structured Data Guides Feature Learning: A Case Study of the Parity Problem
The Next Symbol Prediction Problem: PAC-learning and its relation to Language Mode
Why Do We Need Weight Decay for Overparameterized Deep Networks?
The Double-Edged Sword: Perception and Uncertainty in Inverse Problems ls
Near-Interpolators: Fast Norm Growth and Tempered Near-Overfitting
On robust overfitting: adversarial training induced distribution matters
Are Graph Neural Networks Optimal Approximation Algorithms?
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention