8:50 a.m. - 9:00 a.m.
Opening Remarks
9:00 a.m. - 9:45 a.m.
Invited Talk: Flat Minima and Generalization: from Matrix Sensing to Neural Networks
Speaker: Maryam Fazel Maryam Fazel
Abstract: When do overparameterized neural networks avoid overfitting and generalize to unseen data? Empirical evidence suggests that the shape of the training loss function near the solution matters---the minima where the loss is “flatter” tend to lead to better generalization. Yet quantifying flatness and its rigorous analysis, even in simple models, has remained elusive.
In this talk, we examine overparameterized nonconvex models such as low-rank matrix recovery, matrix completion, robust PCA, and a 2-layer neural network as test cases. We show that under standard statistical assumptions, "flat" minima (minima with the smallest local average curvature, measured by the trace of the Hessian matrix) provably generalize in all these cases. These algorithm-agnostic results suggest a theoretical basis for favoring methods that bias iterates towards flat solutions, and help inform the design of better training algorithms.
9:45 a.m. - 10:30 a.m.
Invited Talk: A Theoretical Perspective on Hardness of Sampling and Learning from Samples in High Dimensions
Speaker: Lenka Zdeborová
Abstract: Recent advancements in generative modelling, including flow-based, diffusion-based, and autoregressive networks, have achieved remarkable success in data generation. However, understanding their performance and limitations, particularly in high-dimensional settings, remains an open challenge. This talk explores the intersection of generative models and statistical physics, leveraging insights from spin-glass theory and denoising frameworks.
We first examine the efficiency of generative models compared to classical methods like Monte Carlo and Langevin dynamics in sampling from complex distributions, focusing on phase transitions that impact sampling performance. Next, we analyze denoising autoencoders in high dimensions, providing closed-form results that reveal their advantage over simpler architectures. Finally, we analyze the training of flow-based generative models on limited samples, presenting sharp theoretical characterizations of their learning dynamics.
Talk based on:
Sampling with flows, diffusion and autoregressive neural networks: A spin-glass perspective, arXiv:2308.14085, PNAS’24, [Ghio, Dandi, Krzakala, LZ]
High-dimensional Asymptotics of Denoising Autoencoders, arXiv:2305.11041, NeurIPS’23 spotlight [Cui, LZ]
Analysis of learning a flow-based generative model from limited sample complexity, arXiv:2310.03575, ICLR’24 [Cui, Vanden-Eijnden, Krzakala, LZ]
10:30 a.m. - 10:45 a.m.
Abstract: We investigate the theoretical foundations of classifier-free guidance (CFG). CFG is the dominant method of conditional sampling for text-to-image diffusion models, yet unlike other aspects of diffusion, it remains on shaky theoretical footing. In this paper, we disprove common misconceptions, by showing that CFG interacts differently with DDPM and DDIM, and neither sampler with CFG generates the gamma-powered distribution $p(x|c)^\gamma p(x)^{1-\gamma}$. Then, we clarify the behavior of CFG by showing that it is a kind of predictor-corrector method (Song et al., 2020) that alternates between denoising and sharpening, which we call predictor-corrector guidance (PCG). We prove that in the SDE limit, CFG is actually equivalent to combining a DDIM predictor for the conditional distribution together with a Langevin dynamics corrector for a gamma-powered distribution (with a carefully chosen gamma). Our work thus provides a lens to theoretically understand CFG by embedding it in a broader design space of principled sampling methods.
10:45 a.m. - 11:00 a.m.
Oral: Towards Characterizing the Value of Edge Embeddings in Graph Neural Networks
Dhruv Rohatgi, Tanya Marwah, Zachary Chase Lipton, Jianfeng Lu, Ankur Moitra, Andrej Risteski
Abstract: Graph neural networks (GNNs) are the dominant approach to solving machine learning problems defined over graphs. Despite much theoretical and empirical work in recent years, our understanding of finer-grained aspects of architectural design for GNNs remains impoverished. In this paper, we consider the benefits of architectures that maintain and update edge embeddings. On the theoretical front, under a suitable computational abstraction for a layer in the model, as well as memory constraints on the embeddings, we show that there are natural tasks on graphical models for which architectures leveraging edge embeddings can be much shallower. Our techniques are inspired by results on time-space tradeoffs in theoretical computer science. Empirically, we show architectures that maintain edge embeddings almost always improve on their node-based counterparts---frequently significantly so in topologies that have "hub" nodes.
11:00 a.m. - 11:15 a.m.
Siyu Chen, Beining Wu, Miao Lu, Zhuoran Yang, Tianhao Wang
Abstract: In this work, we tackle the following question: Can neural networks trained with gradient-based methods achieve the optimal statistical-computational tradeoff in learning Gaussian single-index models?
Prior research has shown that any polynomial-time algorithm under the statistical query (SQ) framework requires $\Omega(d^{s^\star/2}\lor d)$ samples, where $s^\star$ is the generative exponent representing the intrinsic difficulty of learning the underlying model. However, it remains unknown whether neural networks can achieve this sample complexity. Inspired by prior techniques such as label transformation and landscape smoothing for learning single-index models, we propose a unified gradient-based algorithm for training a two-layer neural network in polynomial time. Our method is adaptable to a variety of loss and activation functions, covering a broad class of existing approaches. We show that our algorithm learns a feature representation that strongly aligns with the unknown signal $\theta^\star$, with sample complexity $\tilde O (d^{s^\star/2} \lor d)$, matching the SQ lower bound up to a polylogarithmic factor for all generative exponents $s^\star\geq 1$. Furthermore, we extend our approach to the setting where $\theta^\star$ is $k$-sparse for $k = o(\sqrt{d})$ by introducing a novel weight perturbation technique that leverages the sparsity structure. We derive a corresponding SQ lower bound
of order $\tilde\Omega(k^{s^\star})$, matched by our method up to a polylogarithmic factor. Our framework, especially the weight perturbation technique, is of independent interest, and suggests potential gradient-based solutions to other problems such as sparse tensor PCA.
11:15 a.m. - 12:15 p.m.
Poster Session 1
12:15 p.m. - 1:30 p.m.
Lunch Break
1:30 p.m. - 2:15 p.m.
Invited Talk: Scaling Deep Learning Optimization: Insights into Efficiency, Preconditioning, and Critical Batch Sizes
Speaker: Sham Kakade
Abstract: Optimizing large-scale language models efficiently is critical as model sizes grow. This talk synthesizes insights from recent work on optimizer design, preconditioning, and critical batch size scaling. We compare widely used optimizers, revealing that practical considerations often outweigh performance differences and highlight the specific directions for improvement. Additionally, we establish new theoretical connections for Shampoo’s preconditioner and introduce SOAP, a hybrid method combining Shampoo's efficiency with Adam's simplicity, reducing wall-clock time significantly. Finally, we investigate how critical batch size scales with data, providing actionable insights for parallelism in large-scale training.
2:15 p.m. - 3:00 p.m.
Invited Talk: Open problems in LLM Theory, DL theory, and the role of theory.
Speaker: Matus Telgarsky
Abstract: This talk will first address basic cultural difficulties faced by the theory community
(and especially by junior theorists), and then discuss 4 areas of progress: small models, conditional theory, frontier algorithms, and classical analysis.
3:00 p.m. - 3:15 p.m.
Oral: Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues
Riccardo Grazzi, Julien Siems, Jörg K.H. Franke, Arber Zela, Frank Hutter, Massimiliano Pontil
Abstract: Linear Recurrent Neural Networks (LRNNs) such as Mamba, RWKV, GLA, mLSTM, and DeltaNet have emerged as efficient alternatives to Transformers in large language modeling, offering linear scaling with sequence length and improved training efficiency. However, LRNNs struggle to perform state-tracking which may impair performance in tasks such as code evaluation or tracking a chess game. Even parity, the simplest state-tracking task, which non-linear RNNs like LSTM handle effectively, cannot be solved by current LRNNs. Recently, Sarrof et al. (2024) demonstrated that the failure of LRNNs like Mamba to solve parity stems from restricting the value range of their diagonal state-transition matrices to $[0, 1]$ and that incorporating negative values can resolve this issue. We extend this result to non-diagonal LRNNs, which have recently shown promise in models such as DeltaNet. We prove that finite precision LRNNs with state-transition matrices having only positive eigenvalues cannot solve parity, while complex eigenvalues are needed to count modulo $3$. Notably, we also prove that LRNNs can learn any regular language when their state-transition matrices are products of identity minus vector outer product matrices, each with eigenvalues in the range $[-1, 1]$. Our empirical results confirm that extending the eigenvalue range of models like Mamba and DeltaNet to include negative values not only enables them to solve parity but consistently improves their performance on state-tracking tasks. Furthermore, pre-training LRNNs with an extended eigenvalue range for language modeling achieves comparable performance and stability while showing promise on code and math data. Our work enhances the expressivity of modern LRNNs, broadening their applicability without changing the cost of training or inference.
3:15 p.m. - 3:30 p.m.
Oral: Understanding Factual Recall in Transformers via Associative Memories
Eshaan Nichani, Jason D. Lee, Alberto Bietti
Abstract: Large language models have demonstrated an impressive ability to perform factual recall. Prior work has found that transformers trained on factual recall tasks can store information at a rate proportional to their parameter count. In our work, we show that shallow transformers can use a combination of associative memories to obtain such near optimal storage capacity. We begin by proving that the storage capacities of both linear and MLP associative memories scale linearly with parameter count. We next introduce a synthetic factual recall task, and prove that a transformer with a single layer of self-attention followed by an MLP can obtain 100% accuracy on the task whenever either the total number of self-attention parameters or MLP parameters scales (up to log factors) linearly with the number of facts. In particular, the transformer can trade off between using the value matrices or the MLP as an associative memory to store the dataset of facts. We complement these expressivity results with an analysis of the gradient flow trajectory of a simplified linear attention model trained on our factual recall task, where we show that the model exhibits sequential learning behavior.
3:30 p.m. - 3:45 p.m.
Oral: Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael Jordan, Song Mei
Abstract: We investigate the mechanisms behind three puzzling phenomena observed in transformer-based large language models (LLMs): attention sinks, value-state drains, and residual-state peaks, collectively referred to the extreme-token phenomena. First, we demonstrate that these phenomena also arise in simpler architectures—transformers with one to three layers—trained on a toy model, the Bigram-Backcopy (BB) task. In this setting, we identify an active-dormant mechanism that causes attention heads to become attention sinks for certain domain-specific inputs while remaining non-sinks for others. We further develop a precise theoretical characterization of the training dynamics that lead to these phenomena, revealing that they are driven by a mutual reinforcement mechanism. By small interventions, we demonstrate ways to avoid extreme-token phenomena during pre-training. Next, we extend our analysis to pre-trained LLMs, including Llama and OLMo, revealing that many attention heads are governed by a similar active-dormant mechanism as in the BB task. We further show that the same mutual reinforcement mechanism drives the emergence of extreme-token phenomena during LLM pre-training. Our results study the mechanisms behind extreme-token phenomena in both synthetic and real settings and offer potential mitigation strategies.
3:45 p.m. - 4:00 p.m.
Oral: Mixture of Parrots: Mixtures of Experts Improve Memorization More Than Reasoning
Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-Melis, Yuanzhi Li, Sham M. Kakade, Eran Malach
Abstract: The Mixture-of-Experts (MoE) architecture enables a significant increase in the total number of model parameters with minimal computational overhead. However, it is not clear what performance tradeoffs, if any, exist between MoEs and standard dense transformers. In this paper, we show that as we increase the number of experts (while fixing the number of active parameters), the memorization performance consistently increases while the reasoning capabilities saturate. We begin by analyzing the theoretical limitations of MoEs at reasoning. We prove that there exist graph problems that cannot be solved by any number of experts of a certain width; however, the same task can be solved by a dense model with a slightly larger width. On the other hand, we find that on memory-intensive tasks, MoEs can effectively leverage a small number of active parameters with a large number of experts to memorize the data. To empirically validate our findings, we pre-train a series of MoEs and dense transformers and evaluate them on commonly used benchmarks in math and natural language. We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.
4:00 p.m. - 5:00 p.m.
Poster Session 2
Classifier-Free Guidance is a Predictor-Corrector. Arwen Bradley, Preetum Nakkiran.
Diffusion Models With Learned Adaptive Noise Processes. Subham Sekhar Sahoo, Aaron Gokaslan, Christopher De Sa, Volodymyr Kuleshov.
Diffusion Model Learns Low-Dimensional Distributions via Subspace Clustering. Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma, Qing Qu.
Simple and Effective Masked Diffusion Language Models. Subham Sekhar Sahoo, Marianne Arriola, Aaron Gokaslan, Yair Schiff, Edgar Mariano Marroquin, Justin T Chiu, Alexander M Rush, Volodymyr Kuleshov.
How Discrete and Continuous Diffusion Meet: Comprehensive Analysis of Discrete Diffusion Models via a Stochastic Integral Framework. Yinuo Ren, Haoxuan Chen, Grant M. Rotskoff, Lexing Ying.
Understanding Diffusion-based Representation Learning via Low-Dimensional Modeling. Xiao Li, Zekai Zhang, Xiang Li, Siyi Chen, Zhihui Zhu, Peng Wang, Qing Qu.
Comparing Implicit and Denoising Score-Matching Objectives. Artem Artemev, Ayan Das, Farhang Nabiei, Alberto Bernacchia.
Complexity of Vector-valued Prediction: From Linear Models to Stochastic Convex Optimization. Matan Schliserman, Tomer Koren.
Information-Theoretic Foundations for Neural Scaling Laws. Hong Jun Jeon, Benjamin Van Roy.
Harnessing the Power of Vicinity-Informed Analysis for Classification under Covariate Shift. Mitsuhiro Fujikawa, Youhei Akimoto, Jun Sakuma, Kazuto Fukuchi.
Optimal Protocols for Continual Learning via Statistical Physics and Control Theory. Francesco Mori, Stefano Sarao Mannelli, Francesca Mignacco.
The GAN is dead; long live the GAN! A Modern GAN Baseline. Nick Huang, Aaron Gokaslan, Volodymyr Kuleshov, James Tompkin.
Label Noise: Ignorance Is Bliss. Yilun Zhu, Jianxin Zhang, Aditya Gangrade, Clayton Scott.
Improving the Gaussian Approximation in Neural Networks: Para-Gaussians and Edgeworth Expansions. Mihai Nica, Janosch Ortmann.
A Theoretical Framework for Federated Domain Generalization with Gradient Alignment. Mahdiyar Molahasani, Milad Soltany, Farhad Pourpanah, Michael Greenspan, Ali Etemad.
Sample compression unleashed : New generalization bounds for real valued losses. Mathieu Bazinet, Valentina Zantedeschi, Pascal Germain.
Algorithmic Stability of Minimum-Norm Interpolating Deep Neural Networks. Ouns El Harzli, Yoonsoo Nam, Ilja Kuzborskij, Bernardo Cuenca Grau, Ard A. Louis.
On Your Mark, Get Set, Warmup!. Dayal Singh Kalra, Maissam Barkeshli.
Benign Overfitting in Out-of-Distribution Generalization of Linear Models. Shange Tang, Jiayun Wu, Jianqing Fan, Chi Jin.
Benign Overfitting in Single-Head Attention. Roey Magen, Shuning Shang, Zhiwei Xu, Spencer Frei, Wei Hu, Gal Vardi.
A Theory of Initialisation's Impact on Specialisation. Devon Jarvis, Sebastian Lee, Clémentine Carla Juliette Dominé, Andrew M Saxe, Stefano Sarao Mannelli.
Leveraging Intermediate Neural Collapse with Simplex ETFs for Efficient Deep Neural Networks. Emily Liu.
Can Bayesian Neural Networks Make Confident Predictions?. Katharine Fisher.
Bayesian Treatment of the Spectrum of the Empirical Kernel in (Sub)Linear-Width Neural Networks. Ouns El Harzli, Bernardo Cuenca Grau.
Does Machine Bring in Extra Bias in Learning? Approximating Discrimination Within Models Quickly. Yijun Bian, Yujie Luo, Ping Xu.
Increasing Fairness via Combination with Learning Guarantees. Yijun Bian, Kun Zhang.
Optimizing Fine-Tuning Efficiency: Gradient Subspace Tracking on Grassmann Manifolds for Large Language Models. Sahar Rajabi, Sirisha Rambhatla.
Accumulating Data Avoids Model Collapse. Joshua Kazdan, Apratim Dey, Rylan Schaeffer, Matthias Gerstgrasser, Rafael Rafailov, David L. Donoho, Sanmi Koyejo.
Adversarial Attacks as Near-Zero Eigenvalues in the Empirical Kernel of Neural Networks. Ouns El Harzli, Bernardo Cuenca Grau.
Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data. Binghui Li, Yuanzhi Li.
Continuous-Time Analysis of Adaptive Optimization and Normalization. Rhys Gould, Hidenori Tanaka.
On the Implicit Relation between Low-Rank Adaptation and Differential Privacy. Saber Malekmohammadi, Golnoosh Farnadi.
Convergence Properties of Hyperbolic Neural Networks on Riemannian Manifolds. Nico Alvarado, Sebastian Burgos.
Geometric Deep Learning with Quasiconformal Neural Networks: An Introduction. Nico Alvarado, Hans Lobel.
HERTA: A High-Efficiency and Rigorous Training Algorithm for Unfolded Graph Neural Networks. Yongyi Yang, Jiaming Yang, Wei Hu, Michal Derezinski.
Towards Principled Graph Transformers. Luis Müller, Daniel Kusuma, Blai Bonet, Christopher Morris.
Towards characterizing the value of edge embeddings in Graph Neural Networks. Dhruv Rohatgi, Tanya Marwah, Zachary Chase Lipton, Jianfeng Lu, Ankur Moitra, Andrej Risteski.
Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning. Alexey Rukhovich, Alexander Podolskiy, Irina Piontkovskaya.
Exploring Task Affinities through NTK Alignment and Early Training Dynamics in Multi-Task Learning. Yoann Morello, Emilie Gregoire, Sam Verboven.
Information-Theoretic Generalization Bounds for Batch Reinforcement Learning. Xingtu Liu.
Misspecified Q -Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error. Ally Yalei Du, Lin Yang, Ruosong Wang.
A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers. William Merrill, Ashish Sabharwal.
Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization. Noam Razin, Sadhika Malladi, Adithya Bhaskar, Danqi Chen, Sanjeev Arora, Boris Hanin.
A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules. Kairong Luo, Haodong Wen, Shengding Hu, Zhenbo Sun, Zhiyuan Liu, Maosong Sun, Kaifeng Lyu, Wenguang Chen.
Depth Extrapolation of Decoders Trained on Nested Structures. Emile R Richard.
Transformers Provably Solve Parity Efficiently with Chain of Thought. Juno Kim, Taiji Suzuki.
From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency. Kaiyue Wen, Huaqing Zhang, Hongzhou Lin, Jingzhao Zhang.
Composing Global Optimizers to Reasoning Tasks via Algebraic Objects in Neural Nets. Yuandong Tian.
Transformers are Efficient Compilers, Provably. Xiyu Zhai, Runlong Zhou, Liao Zhang, Simon Shaolei Du.
Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues. Riccardo Grazzi, Julien Siems, Jörg K.H. Franke, Arber Zela, Frank Hutter, massimiliano pontil.
Provable unlearning in topic modeling and downstream tasks. Stanley Wei, Sadhika Malladi, Sanjeev Arora, Amartya Sanyal.
Dynamics of Concept Learning and Compositional Generalization. Yongyi Yang, Core Francisco Park, Ekdeep Singh Lubana, Maya Okawa, Wei Hu, Hidenori Tanaka.
Provable weak-to-strong generalization via benign overfitting. David Xing Wu, Anant Sahai.
Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models. Yuda Song, Hanlin Zhang, Udaya Ghai, Carson Eisenach, Sham M. Kakade, Dean Foster.
Self-Improvement in Language Models: The Sharpening Mechanism. Audrey Huang, Adam Block, Dylan J Foster, Dhruv Rohatgi, Cyril Zhang, Max Simchowitz, Jordan T. Ash, Akshay Krishnamurthy.
Progressive distillation induces an implicit curriculum. Abhishek Panigrahi, Bingbin Liu, Sadhika Malladi, Andrej Risteski, Surbhi Goel.
SGD and Weight Decay Secretly Minimize the Rank of Your Neural Network. Tomer Galanti, Zachary S Siegel, Aparna Gupte, Tomaso A Poggio.
From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks. Clémentine Carla Juliette Dominé, Nicolas Anguita, Alexandra Maria Proca, Lukas Braun, Daniel Kunin, Pedro A. M. Mediano, Andrew M Saxe.
Bias in Motion: Theoretical Insights into the Dynamics of Bias in SGD Training. Anchit Jain, Rozhin Nobahari, Aristide Baratin, Stefano Sarao Mannelli.
Parameter Symmetry and Emergence of Noise Equilibrium in Stochastic Training. Liu Ziyin, Mingze Wang, Hongchao Li, Lei Wu.
Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection. Aaron Alvarado Kristanto Julistiono, Davoud Ataee Tarzanagh, Navid Azizan.
Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos. Dayal Singh Kalra, Tianyu He, Maissam Barkeshli.
Robust Feature Learning for Multi-Index Models in High Dimensions. Alireza Mousavi-Hosseini, Adel Javanmard, Murat A Erdogdu.
Convergence of Distributed Adaptive Optimization with Local Updates. Ziheng Cheng, Margalit Glasgow.
How do students become teachers: A dynamical analysis for two-layer neural networks. Zhenyu Zhu, Fanghui Liu, Volkan Cevher.
Optimality and Adaptivity of Deep Neural Features for Instrumental Variable Regression. Juno Kim, Dimitri Meunier, Arthur Gretton, Taiji Suzuki, Zhu Li.
Can Neural Networks Achieve Optimal Computational-statistical Tradeoff? An Analysis on Single-Index Model. Siyu Chen, Beining Wu, Miao Lu, Zhuoran Yang, Tianhao Wang.
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs. Tianyu Guo, Druv Pai, Yu Bai, Jiantao Jiao, Michael Jordan, Song Mei.
Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks. Nikolaos Tsilivis, Gal Vardi, Julia Kempe.
Heavy-Tailed Class Imbalance and Why Adam Outperforms Gradient Descent on Language Models. Frederik Kunstner, Robin Yadav, Alan Milligan, Mark Schmidt, Alberto Bietti.
Implicit Bias of Adam versus Gradient Descent in One-Hidden-Layer Neural Networks. Bhavya Vasudeva, Vatsal Sharan, Mahdi Soltanolkotabi.
Emergence in non-neural models: grokking modular arithmetic via average gradient outer product. Neil Rohit Mallinar, Daniel Beaglehole, Libin Zhu, Adityanarayanan Radhakrishnan, Parthe Pandit, Mikhail Belkin.
The Crucial Role of Samplers in Online Direct Preference Optimization. Ruizhe Shi, Runlong Zhou, Simon Shaolei Du.
Declarative characterizations of direct preference alignment algorithms. Kyle Richardson, Vivek Srikumar, Ashish Sabharwal.
Efficiently Learning at Test-Time: Active Fine-Tuning of LLMs. Jonas Hübotter, Sascha Bongni, Ido Hakimi, Andreas Krause.
Towards the Effect of Examples on In-Context Learning: A Theoretical Case Study. Pengfei He, Yingqian Cui, Han Xu, Hui Liu, Makoto Yamada, Jiliang Tang, Yue Xing.
In-Context Learning by Linear Attention: Exact Asymptotics and Experiments. Yue Lu, Mary Letey, Jacob A Zavatone-Veth, Anindita Maiti, Cengiz Pehlevan.
An empirical study of the (L0,L1)-smoothness condition. Y Cooper.
Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers. Yibo Jiang, Goutham Rajendran, Pradeep Kumar Ravikumar, Bryon Aragam.
Mixture of Parrots: Mixtures of experts improve memorization more than reasoning. Samy Jelassi, Clara Mohri, David Brandfonbrener, Alex Gu, Nikhil Vyas, Nikhil Anand, David Alvarez-Melis, Yuanzhi Li, Sham M. Kakade, Eran Malach.
Understanding Factual Recall in Transformers via Associative Memories. Eshaan Nichani, Jason D. Lee, Alberto Bietti.