A public ELLIS reading group exploring the interplay between the mathematical foundations of deep learning and the practical challenge of making ML efficient — from optimization theory to hardware-aware training.
30. March 2026 @ 5pm CEST / 11am EST / 8am PST [timezone converter]
s1: Simple test-time scaling
Niklas Muennighoff, Stanford University, Allen Institute for AI, Contextual AI, USA
Abstract: Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1- 32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24.
13. April 2026 @ 5pm CEST / 11am EST / 8am PST [timezone converter]
Sustainable Development and Energy Efficiency in Deep Learning
Raphael Fischer, TU Dortmund and Lamarr Institute, Germany
Abstract: With growing environmental impacts caused by modern deep learning, researchers have to establish reporting standards that go beyond predictive performance and explicitly account for sustainability. However, quantifying and informing on the energy efficiency of models and systems remains hard. The talk explores methods and experimental insights for understanding and balancing model performance in a multi-dimensional way, thus paving the way for sustainable development in the field.
27. April 2026 @ 5pm CEST / 11am EST / 8am PST [timezone converter]
It's not a Lottery, it's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task
Hannah Pinson, Eindhoven University of Technology, Netherlands
Abstract: Our theoretical understanding of neural networks is lagging behind their empirical success. One of the important unexplained phenomena is why and how, during the process of training with gradient descent, the theoretical capacity of neural networks is reduced to an effective capacity that fits the task. We here investigate the mechanism by which gradient descent achieves this through analyzing the learning dynamics at the level of individual neurons in single hidden layer ReLU networks. We identify three dynamical principles -- mutual alignment, unlocking and racing -- that together explain why we can often successfully reduce capacity after training through the merging of equivalent neurons or the pruning of low norm weights. We specifically explain the mechanism behind the lottery ticket conjecture, or why the specific, beneficial initial conditions of some neurons lead them to obtain higher weight norms.
11. May 2026 @ 5pm CEST / 11am EST / 8am PST [timezone converter]
Finite-Time Lyapunov Exponents of Deep Neural Networks
Bernhard Mehlig, Department of Physics, University of Gothenburg, Sweden
Abstract: We compute how small input perturbations affect the output of deep neural networks, exploring an analogy between deep feed-forward networks and dynamical systems, where the growth or decay of local perturbations is characterized by finite-time Lyapunov exponents. We show that the maximal exponent forms geometrical structures in input space, akin to coherent structures in dynamical systems. Ridges of large positive exponents divide input space into different regions that the network associates with different classes. These ridges visualize the geometry that deep networks construct in input space, shedding light on the fundamental mechanisms underlying their learning capabilities.
16. March 2026 @ 5pm CET — ▶️ YouTube
Procedural Pretraining: Warming Up Language Models with Abstract Data
Liangze Jiang, EPFL and Idiap Research Institute, Switzerland
Zachary Shinnick, Australian Institute for Machine Learning (AIML), Adelaide University, Australia
arXiv: https://arxiv.org/pdf/2601.21725
9. March 2026 @ 5pm CET — ▶️ YouTube
How Does Sharpness-Aware Minimization Minimize Sharpness?
Kaiyue Wen, Stanford University, USA
arXiv: https://arxiv.org/abs/2211.05729
2. March 2026 @ 5pm CET — ▶️ YouTube
When Flatness Does (Not) Guarantee Adversarial Robustness
Nils Philipp Walter, CISPA Helmholtz Center for Information Security, Germany
arXiv: https://arxiv.org/pdf/2510.14231
9. February 2026 @ 5pm CET — ▶️ YouTube
Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures
Yedi Zhang, Gatsby Computational Neuroscience Unit, University College London, UK
arXiv: https://arxiv.org/pdf/2512.20607
The paper on Muon Yedi mentioned in the talk is now on arXiv: https://arxiv.org/abs/2603.00742
19. January 2026 @ 5pm CET — ▶️ YouTube
Fast Video Generation (multiple papers)
Rahim Entezari, Wayve.ai
12. January 2026 @ 5pm CET — ▶️ YouTube
Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking
Ting Han, Lamarr Institute, TU Dortmund, Germany and Institute for AI in Medicine, UK Essen
OpenReview: https://openreview.net/pdf?id=lbtOctHDQ3
Contact us for questions or suggestions via efficientml@gmail.com.
Self-nominations to present your published work in the reading group are welcome.