A public ELLIS reading group exploring the interplay between the mathematical foundations of deep learning and the practical challenge of making ML efficient — from optimization theory to hardware-aware training.
30. March 2026 @ 5pm CEST / 11am EST / 8am PST [timezone converter]
s1: Simple test-time scaling
Niklas Muennighoff, Stanford University, Allen Institute for AI, Contextual AI, USA
Abstract: Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1- 32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24.
11. May 2026 @ 5pm CEST / 11am EST / 8am PST [timezone converter]
Finite-Time Lyapunov Exponents of Deep Neural Networks
Bernhard Mehlig, Department of Physics, University of Gothenburg, Sweden
Abstract: We compute how small input perturbations affect the output of deep neural networks, exploring an analogy between deep feed-forward networks and dynamical systems, where the growth or decay of local perturbations is characterized by finite-time Lyapunov exponents. We show that the maximal exponent forms geometrical structures in input space, akin to coherent structures in dynamical systems. Ridges of large positive exponents divide input space into different regions that the network associates with different classes. These ridges visualize the geometry that deep networks construct in input space, shedding light on the fundamental mechanisms underlying their learning capabilities.
16. March 2026 @ 5pm CET — ▶️ YouTube
Procedural Pretraining: Warming Up Language Models with Abstract Data
Liangze Jiang, EPFL and Idiap Research Institute, Switzerland
Zachary Shinnick, Australian Institute for Machine Learning (AIML), Adelaide University, Australia
arXiv: https://arxiv.org/pdf/2601.21725
9. March 2026 @ 5pm CET — ▶️ YouTube
How Does Sharpness-Aware Minimization Minimize Sharpness?
Kaiyue Wen, Stanford University, USA
arXiv: https://arxiv.org/abs/2211.05729
2. March 2026 @ 5pm CET — ▶️ YouTube
When Flatness Does (Not) Guarantee Adversarial Robustness
Nils Philipp Walter, CISPA Helmholtz Center for Information Security, Germany
arXiv: https://arxiv.org/pdf/2510.14231
9. February 2026 @ 5pm CET — ▶️ YouTube
Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures
Yedi Zhang, Gatsby Computational Neuroscience Unit, University College London, UK
arXiv: https://arxiv.org/pdf/2512.20607
The paper on Muon Yedi mentioned in the talk is now on arXiv: https://arxiv.org/abs/2603.00742
19. January 2026 @ 5pm CET — ▶️ YouTube
Fast Video Generation (multiple papers)
Rahim Entezari, Wayve.ai
12. January 2026 @ 5pm CET — ▶️ YouTube
Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking
Ting Han, Lamarr Institute, TU Dortmund, Germany and Institute for AI in Medicine, UK Essen
OpenReview: https://openreview.net/pdf?id=lbtOctHDQ3
Contact us for questions or suggestions via efficientml@gmail.com.
Self-nominations to present your published work in the reading group are welcome.