Schedule

Schedule

Friday, July 25: Straus 2

All times in Vienna, Austria

Poster size (for workshops): Portrait orientation, 24"w x 36"h (61 cm w x 91.5 cm h).

*Note this is different for the main conference

Credit: https://losslandscape.com/gallery/

Morning Session 9:00 am - 12:00 pm

9:00 am: Opening Remarks

9:00 am - 9:30 am: Aukosh Jagannath (University of Waterloo), Spectral alignment for high-dimensional SGD

Abstract: Over the last decade, a body of rich predictions has been made about the spectra of empirical Hessian and information matrices over the course of training (via SGD) in overparametrized networks. I'll present a recent work, in collaboration with G. Ben Arous (NYU Courant), R.Ghessari (Northwestern U.), and J. Huang (U. Penn), where we rigorously establish some of these predictions. We prove that in two canonical classification tasks for multi-class high-dimensional mixtures and either 1 or 2-layer neural networks, the SGD trajectory rapidly aligns with emerging low-rank outlier eigenspaces of the Hessian and gradient matrices. Moreover, in multi-layer settings this alignment occurs per layer, with the final layer's outlier eigenspace evolving over the course of training and exhibiting rank deficiency when the SGD converges to sub-optimal classifiers.

9:30 am - 10:00 am: Angelica Chen (New York University), Misleading Endpoints-Lessons from LLM Training Dynamics

Abstract: Many machine learning methods focus on metrics acquired at the end of the training — however, interpreting only these metrics can be misleading. In this talk, we focus on two examples of how analyzing training dynamics can yield deeper insights about LLM behavior than interpreting the endpoints alone. In the first, we demonstrate how a common interpretability artifact may appear to be uncorrelated with model performance at the end of training, but in fact exhibits a causal relationship with key learning strategies at the beginning of training. In the second, we study an example where the theoretical properties of the optimal policy differ dramatically from those of the fully trained model. We then show how the model’s learning dynamics on different partitions of the training dataset offers an explanation that reconciles this difference. In both cases, solely interpreting the endpoint of training (either theoretical or empirical) may misrepresent what the model actually learns during training.


Bio: Angelica Chen is a PhD student at NYU, advised by Kyunghyun Cho. She is broadly interested in understanding LLM training and using these insights to improve how LLMs learn from feedback. She has previously interned at Google DeepMind and Google Research, and completed her undergrad at Princeton, where her work earned an Outstanding Computer Science Thesis award.

10:00 am - 11:00 am: Poster session (in-person) / Break

Best Papers of HiLD Awards

Behrooz Tahmasebi, Ashkan Soleymani, Dara Bahri, Stefanie Jegelka, Patrick Jaillet, A Universal Class of Sharpness-Aware Minimization Algorithms

Derek Lim, Theo Putterman, Robin Walters, Haggai Maron, Stefanie Jegelka, The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof 

George Wang, Matthew Farrugia-Roberts, Jesse Hoogland, Lian Carroll, Susan Wei, Daniel Murfet, Loss landscape geometry reveals stagewise development of transformers

Abstract: Recently, there has been a surge in interest in developing optimization algorithms for overparameterized models as achieving generalization is believed to require algorithms with suitable biases. This interest centers on minimizing sharpness of the original loss function; the Sharpness-Aware Minimization (SAM) algorithm has proven effective. However, existing literature focuses on only a few sharpness measures (such as the maximum eigenvalue/trace of the training loss Hessian), which may not necessarily yield meaningful insights for non-convex optimization scenarios (e.g., neural networks). Moreover, many sharpness measures show sensitivity to parameter invariances in neural networks, e.g., they magnify significantly under rescaling parameters. Hence, here we introduce a new class of sharpness measures leading to sharpness-aware objective functions. We prove that these measures are universally expressive, allowing any function of the training loss Hessian matrix to be represented by choosing appropriate hyperparameters. Furthermore, we show that the proposed objective functions explicitly bias towards minimizing their corresponding sharpness measures. Finally, as an example of our proposed general framework, we present Frob-SAM and Det-SAM, which are specifically designed to minimize the Frobenius norm and the determinant of the Hessian of the training loss, respectively. We also demonstrate the advantages of our general framework through an extensive series of experiments.

Abstract: Many algorithms and observed phenomena in deep learning appear to be affected by parameter symmetries --- transformations of neural network parameters that do not change the underlying neural network function. These include linear mode connectivity, model merging, Bayesian neural network inference, metanetworks, and several other characteristics of optimization or loss-landscapes. In this work, we empirically investigate the impact of neural parameter symmetries by introducing new neural network architectures that have reduced parameter space symmetries. We develop two methods, with some provable guarantees, of modifying standard neural networks to reduce parameter space symmetries. With these new methods, we conduct a comprehensive experimental study consisting of multiple tasks aimed at assessing the effect of removing parameter symmetries. Our experiments reveal several interesting observations on the empirical impact of parameter symmetries; for instance, we observe linear mode connectivity and monotonic linear interpolation in our networks, without any alignment of weight spaces.

Abstract: The development of the internal structure of neural networks throughout training occurs in tandem with changes in the local geometry of the population loss. By quantifying the degeneracy of this geometry using the recently proposed Local Learning Coefficient, we show that the training process for a transformer language model can be decomposed into discrete developmental stages. We connect these stages to interpretable shifts in input–output behavior and developments in internal structure. These findings offer new insights into transformer development and underscore the crucial role of loss landscape geometry in understanding the dynamics of deep learning.

11:00 am - 11:30 am: Jason Lee (Princeton), Learning Representations and Associations with Gradient Descent

Abstract: Machine Learning has undergone a paradigm shift with the success of pretrained models. Pretraining models via gradient descent learns transferable representations that adapt to a wide swath of downstream tasks. However, significant prior theoretical work has demonstrated that in many regimes, overparametrized neural networks trained by gradient descent behave like kernel methods, and do not learn transferable representations. In this talk, we close this gap by demonstrating that there is a large class of functions which cannot be efficiently learned by kernel methods but can be easily learned with gradient descent on a neural network by learning representations that are relevant to the target task. We also demonstrate that these representations allow for efficient transfer learning, which is impossible in the kernel regime.


Finally, I will demonstrate how pretraining learns associations for in-context learning with transformers. This leads to a systematic and mechanistic understanding of learning causal structures including the celebrated induction head identified by Anthropic.

11:30 am - noon: Stella Biderman, Learning Representations and Associations with Gradient Descent

Abstract: A commonly cited motivation for doing theoretical work in interpretability and learning dynamics is a desire to empower people who train models, so they can train models better. It's very common for work to fall short of this goal, not because it's bad research but because of the way it is scoped, framed, and designed. Drawing on her experience as both a theorist and a LLM trainer, Stella will discuss what common pitfalls she sees preventing high quality research from having real-world impact and detail how she designs theoretical research programs with an eye towards building tools that will be practically useful when training models.


Bio: Stella Biderman is the executive director of EleutherAI. Her research focuses on understanding how large language models and other large-scale AI systems behave with an eye towards empowering model trainers and model deployers to build systems that behave more desirably. She's also an advocate for free and open source AI technologies and works to ensure that there are public and transparent options for entire technology stack.


12:00 pm - 2:00 pm: Lunch/Break

Afternoon Session (2:00 pm - 5:00 pm)

2:00 pm - 2:30 pm: Lenka Zdeborová (EPFL), Phase transition in high-dimensional learning

Abstract: Emergence in LLMs is surrounded by a plethora of open questions. Emergence in physics is linked to phase transitions. We will describe recent progress in characterizing phase transitions in the performance of neural networks and their consequences for algorithmic hardness. In particular, we will discuss how the staircase picture changes when batches are reused. We also unveil a phase transition between semantic and positional learning in a toy model of dot-product attention.

2:30 pm - 3:00 pm: Pragya Sur (Harvard), Generalization error of min-norm interpolators in transfer learning

Abstract: Min-norm interpolators naturally emerge as implicit regularized limits of modern machine learning algorithms. Recently, their out-of-distribution risk was studied when test samples are unavailable during training. However, in many applications, a limited amount of test data is typically available during training. Properties of min-norm interpolation in this setting are not well understood. In this talk, I will present a characterization of the bias and variance of pooled min-L2-norm interpolation under covariate and model shifts. I will show that the pooled interpolator captures both early fusion and a form of intermediate fusion. Our results have several implications. For example, under model shift, adding data always hurts prediction when the signal-to-noise ratio is low. However, for higher signal-to-noise ratios, transfer learning helps as long as the shift-to-signal ratio lies below a threshold that I will define. I will further present data-driven methods to determine: (i) when the pooled interpolator outperforms the target-based interpolator, and (ii) the optimal number of target samples that minimizes generalization error. Our results also show that under covariate shift, if the source sample size is small relative to the dimension, heterogeneity between domains improves the risk. Time permitting, I will introduce a novel anisotropic local law that helps achieve some of these characterizations and may be of independent interest in random matrix theory. This is based on joint work with Yanke Song and Sohom Bhattacharya.

3:00 pm - 3:30 pm: Kanaka Rajan (Harvard), Brain-Wide Compositionality and Learning Dynamics in Biological Agents

Abstract: Biological agents continually reconcile the internal states of their brain circuits with incoming sensory and environmental evidence to evaluate when and how to act. The brains of biological agents, including animals and humans, exploit many evolutionary innovations, chiefly modularity—observable at the level of anatomically-defined brain regions, cortical layers, and cell types among others—that can be repurposed in a compositional manner to endow the animal with a highly flexible behavioral repertoire. Accordingly, their behaviors show their own modularity, yet such behavioral modules seldom correspond directly to traditional notions of modularity in brains. It remains unclear how to link neural and behavioral modularity in a compositional manner. We propose a comprehensive framework—compositional modes—to identify overarching compositionality spanning specialized submodules, such as brain regions. Our framework directly links the behavioral repertoire with distributed patterns of population activity, brain-wide, at multiple concurrent spatial and temporal scales.

Using whole-brain recordings of zebrafish brains, we introduce an unsupervised pipeline based on neural network models, constrained by biological data, to reveal highly conserved compositional modes across individuals despite the naturalistic (spontaneous or task-independent) nature of their behaviors. These modes provided a scaffolding for other modes that account for the idiosyncratic behavior of each fish. We then demonstrate experimentally that compositional modes can be manipulated in a consistent manner by behavioral and pharmacological perturbations. Our results demonstrate that even natural behavior in different individuals can be decomposed and understood using a relatively small number of neurobehavioral modules—the compositional modes—and elucidate a compositional neural basis of behavior. This approach aligns with recent progress in understanding how reasoning capabilities and internal representational structures develop over the course of learning or training, offering insights into the modularity and flexibility in artificial and biological agents.

Bio: Kanaka Rajan, PhD, is a computational neuroscientist, Associate Professor of Neurobiology at Harvard Medical School, and a founding faculty member of the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University. Her research seeks to understand how important cognitive functions — such as learning, remembering, and deciding — emerge from the cooperative activity of multi-scale neural processes. Using data from neuroscience experiments, Dr. Rajan applies computational frameworks derived from machine learning and statistical physics to uncover integrative theories about the brain that bridge neurobiology and artificial intelligence.

Dr. Rajan’s work has been recognized with several awards (CIFAR Azrieli Global Scholars Program, Allen Institute’s Next Generation Leaders Council, The Harold and Golden Lamport Basic Science Research Award, McKnight Scholars Award, Young Investigator Award from the Brain and Behavior Foundation, Understanding Human Cognition Scholar Award from the James S McDonnell Foundation, Sloan Research Fellowship), and her work is supported by the NIH BRAIN Initiative and the NSF. For more information about Dr. Rajan and her lab, please visit www.rajanlab.com


3:30 pm: Closing Remarks

3:30 pm - 4:30 pm: Poster Session/Break

Best Papers of HiLD Awards

Behrooz Tahmasebi, Ashkan Soleymani, Dara Bahri, Stefanie Jegelka, Patrick Jaillet, A Universal Class of Sharpness-Aware Minimization Algorithms

Derek Lim, Theo Putterman, Robin Walters, Haggai Maron, Stefanie Jegelka, The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof 

George Wang, Matthew Farrugia-Roberts, Jesse Hoogland, Lian Carroll, Susan Wei, Daniel Murfet, Loss landscape geometry reveals stagewise development of transformers

Abstract: Recently, there has been a surge in interest in developing optimization algorithms for overparameterized models as achieving generalization is believed to require algorithms with suitable biases. This interest centers on minimizing sharpness of the original loss function; the Sharpness-Aware Minimization (SAM) algorithm has proven effective. However, existing literature focuses on only a few sharpness measures (such as the maximum eigenvalue/trace of the training loss Hessian), which may not necessarily yield meaningful insights for non-convex optimization scenarios (e.g., neural networks). Moreover, many sharpness measures show sensitivity to parameter invariances in neural networks, e.g., they magnify significantly under rescaling parameters. Hence, here we introduce a new class of sharpness measures leading to sharpness-aware objective functions. We prove that these measures are universally expressive, allowing any function of the training loss Hessian matrix to be represented by choosing appropriate hyperparameters. Furthermore, we show that the proposed objective functions explicitly bias towards minimizing their corresponding sharpness measures. Finally, as an example of our proposed general framework, we present Frob-SAM and Det-SAM, which are specifically designed to minimize the Frobenius norm and the determinant of the Hessian of the training loss, respectively. We also demonstrate the advantages of our general framework through an extensive series of experiments.

Abstract: Many algorithms and observed phenomena in deep learning appear to be affected by parameter symmetries --- transformations of neural network parameters that do not change the underlying neural network function. These include linear mode connectivity, model merging, Bayesian neural network inference, metanetworks, and several other characteristics of optimization or loss-landscapes. In this work, we empirically investigate the impact of neural parameter symmetries by introducing new neural network architectures that have reduced parameter space symmetries. We develop two methods, with some provable guarantees, of modifying standard neural networks to reduce parameter space symmetries. With these new methods, we conduct a comprehensive experimental study consisting of multiple tasks aimed at assessing the effect of removing parameter symmetries. Our experiments reveal several interesting observations on the empirical impact of parameter symmetries; for instance, we observe linear mode connectivity and monotonic linear interpolation in our networks, without any alignment of weight spaces.

Abstract: The development of the internal structure of neural networks throughout training occurs in tandem with changes in the local geometry of the population loss. By quantifying the degeneracy of this geometry using the recently proposed Local Learning Coefficient, we show that the training process for a transformer language model can be decomposed into discrete developmental stages. We connect these stages to interpretable shifts in input–output behavior and developments in internal structure. These findings offer new insights into transformer development and underscore the crucial role of loss landscape geometry in understanding the dynamics of deep learning.

5:00 pm Workshop Concludes