A public ELLIS reading group exploring the interplay between the mathematical foundations of deep learning and the practical challenge of making ML efficient — from optimization theory to hardware-aware training. Learn more about our topics and scope →
Not everything we find interesting makes it into a session. For the rest papers, talks, and ideas worth sharing see Writeups →
15. June 2026 @ 5pm CEST / 11am EST / 8am PST [timezone converter]
Generalization at the Edge of Stability
Mario Tuci, INRIA, CNRS, PSL, France and Imperial College London, UK
Abstract: Training modern neural networks often relies on large learning rates, operating at the edge of stability, where the optimization dynamics exhibit oscillatory and chaotic behavior. Empirically, this regime often yields improved generalization performance, yet the underlying mechanism remains poorly understood. In this work, we represent stochastic optimizers as random dynamical systems, which often converge to a fractal attractor set (rather than a point) with a smaller intrinsic dimension. Building on this connection and inspired by Lyapunov dimension theory, we introduce a novel notion of dimension, coined the ‘sharpness dimension’, and prove a generalization bound based on this dimension. Our results show that generalization in the chaotic regime depends on the complete Hessian spectrum and the structure of its partial determinants, highlighting a complexity that cannot be captured by the trace or spectral norm considered in prior work. Experiments across various MLPs and transformers validate our theory while also providing new insights into the recently observed phenomenon of grokking.
22. June 2026 @ 5pm CEST / 11am EST / 8am PST [timezone converter]
WK, WV is (Linearly) All You Need: On the Necessity of the QKV Weight Triplet in Self-Attention Transformers
Marko Karbevski, In Simplicity Technologies, Skopje, Macedonia
Antonij Mijoski, Institut de Recherche Mathématique Avancée (IRMA), Université de Strasbourg, France
Abstract: Multi-head attention is invariant under the joint GL(d) action (X, W_Q, W_K, W_V) \mapsto (X\Theta,\Theta^{-1}W_Q,\Theta^{-1}W_K,\Theta^{-1}W_V) with two consequences for the QKV triplet: First, any one of W_Q, W_K, W_V can be fixed to I_d without loss of expressivity if XW is precomputed; under mild structural conditions the precomputation folds into the preceding MLP (or, in the first layer, the embedding) at no parameter cost, removing 25% of attention parameters per layer. We prove the multi-layer reduction and analyse the practical obstructions. Second, every linear W_Q already lies on the orbit of I_d, so a learned linear query is redundant: expressive gains in the QKV pathway require at least one of the three to be nonlinear, a branch we realise with the residual query Q(X)=\frac{1}{2}(X + f_\theta(X)) at parity of parameters. This research also led us to examine residual skip connections: MLPs with and without a skip form generically disjoint function classes for modern activations. We validate both halves on GPT-style models trained from scratch under batch-matched comparisons: the reduced 117M model matches the dense 124M baseline; reallocating the saved parameters to the feed-forward sublayer strictly improves on it; and the residual nonlinear query yields a 2.40% relative logloss improvement, while a model with 12.5% more non-embedding parameters achieves 0.94%: a 2.5x relative gain.
OpenReview: https://openreview.net/forum?id=9DE8ISZNak
arXiv: https://arxiv.org/pdf/2510.23912
arXiv: https://arxiv.org/abs/2603.13381
arXiv: https://arxiv.org/pdf/2604.23705
29. June 2026 @ 5pm CEST / 11am EST / 8am PST [timezone converter]
Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers
Shikang Zheng, Shanghai Jiao Tong University and South China University of Technology, China
Abstract: Diffusion Transformers offer state-of-the-art fidelity in image and video synthesis, but their iterative sampling process remains a major bottleneck due to the high cost of transformer forward passes at each timestep. To mitigate this, feature caching has emerged as a training-free acceleration technique that reuses or forecasts hidden representations. However, existing methods often apply a uniform caching strategy across all feature dimensions, ignoring their heterogeneous dynamic behaviors. Therefore, we adopt a new perspective by modeling hidden feature evolution as a mixture of ODEs across dimensions, and introduce HyCa, a Hybrid ODE solver inspired caching framework that applies dimension-wise caching strategies. HyCa achieves near-lossless acceleration across diverse domains and models, including 5.55× speedup on FLUX, 5.56× speedup on HunyuanVideo, 6.24× speedup on Qwen-Image and Qwen-Image-Edit without retraining.
arXiv: https://arxiv.org/abs/2510.04188
Website: https://darrenzheng303.github.io/HyCa.github.io
13. July 2026 @ 5pm CEST / 11am EST / 8am PST [timezone converter]
Geometry-aware similarity metrics for neural representations on Riemannian and statistical manifolds
N Alex Cayco Gajic, École Normale Supérieure Paris, France
Arthur Pellegrino, University College London, UK and Ecole Normale Supérieure Paris, France
Abstract: Similarity measures are widely used to interpret the representational geometries used by neural networks to solve tasks. Yet, because existing methods compare the extrinsic geometry of representations in state space, rather than their intrinsic geometry, they may fail to capture subtle yet crucial distinctions between fundamentally different neural network solutions. Here, we introduce metric similarity analysis (MSA), a novel method which leverages tools from Riemannian geometry to compare the intrinsic geometry of neural representations under the manifold hypothesis. We show that MSA can be used to i) disentangle features of neural computations in deep networks with different learning regimes, ii) compare nonlinear dynamics, and iii) investigate diffusion models. Hence, we introduce a mathematically grounded and broadly applicable framework to understand the mechanisms behind neural computations by comparing their intrinsic geometries.
20. July 2026 @ 5pm CEST / 11am EST / 8am PST [timezone converter]
Temporal Sampling Frequency Matters: A Capacity-Aware Study of End-to-End Driving Trajectory Prediction
Yumao Liu, The Hong Kong University of Science and Technology, China
Abstract: End to end (E2E) autonomous driving trajectory prediction is often trained with camera frames sampled at the highest available temporal frequency, assuming that denser sampling improves performance. We question this assumption by treating temporal sampling frequency as an explicit training set design variable. Starting from high frequency E2E driving datasets, we construct frequency sweep training sets by temporally subsampling camera frames along each trajectory. For each model dataset pair, we train and evaluate the same model under a fixed protocol, so the frequency response reflects how prediction performance changes with sampling frequency. We analyze this response from a capacity aware perspective. Sparse sampling may miss driving relevant cues, while dense sampling may add redundant visual content and off manifold noise. For finite capacity models, this can create a driving irrelevant capacity burden. We evaluate three smaller E2E models and a larger VLA style AutoVLA model on Waymo, nuScenes, and PAVE. Results show model and dataset dependent frequency responses. Smaller E2E models often show non monotonic or near plateau trends and achieve their best 3 second ADE at lower or intermediate frequencies. In contrast, AutoVLA achieves its best 3 second ADE and FDE at the highest evaluated frequency on all three datasets. Iteration matched controls suggest that the advantage of lower or intermediate frequencies for smaller models is not explained only by unequal training update counts. These findings show that temporal sampling frequency should be reported and tuned, rather than fixed to the highest available value.
🌴☀️ The reading group is on holiday throughout August and will return in September! ☀️🌴
28. September 2026 @ 5pm CEST / 11am EST / 8am PST [timezone converter]
ThinkingViT: Matryoshka Thinking Vision Transformer for Elastic Inference
Ali Hojjat, Kiel University and Hamburg University of Technology (TUHH), Germany
Abstract: ViTs deliver SOTA performance, yet their fixed computational budget prevents scalable deployment across heterogeneous hardware. Recent Matryoshka-style Transformer architectures mitigate this by embedding nested subnetworks within a single model to enable scalable inference. However, these models allocate the same amount of compute to all inputs, regardless of their complexity, which leads to inefficiencies. To address this, we introduce ThinkingViT, a nested ViT architecture that employs progressive thinking stages to dynamically adjust inference computation based on input difficulty. ThinkingViT first activates a small subset of the most important attention heads to produce an initial prediction. If the prediction confidence exceeds a predefined threshold, inference terminates early. Otherwise, within the same backbone, it activates a larger subset of attention heads and conducts a new forward pass. This process continues iteratively until the model reaches the predefined confidence level or exhausts its maximum capacity. To boost the performance of subsequent rounds, we introduce a Token Recycling approach that fuses the input embeddings with the embeddings from the previous stage. Experiments show that ThinkingViT surpasses nested baselines by up to 2.0 percentage points (p.p.) in accuracy at the same throughput and by up to 2.9 p.p. at equal GMACs on ImageNet-1K. We show that the backbone-preserving design of ThinkingViT allows it to serve as a plug-in upgrade for ViTs in downstream tasks such as semantic segmentation. We also demonstrate that ThinkingViT transfers effectively to other architectures such as Swin Transformers.
arXiv: https://arxiv.org/abs/2507.10800
Web: https://ds-kiel.github.io/ThinkingViT-project-page/
8. June 2026 @ 5pm CEST — ▶️ YouTube
How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences
Mariia Seleznova, Ludwig Maximilian University of Munich, Germany
arXiv: https://arxiv.org/pdf/2505.19827
1. June 2026 @ 5pm CEST — ▶️ YouTube
Panza: Design and Analysis of a Fully-Local Personalized Text Writing Assistant
Eugenia Iofinova, Institute of Science and Technology Austria
Andrej Jovanovic, University of Cambridge, UK
arXiv: https://arxiv.org/abs/2407.10994
11. May 2026 @ 5pm CEST — ▶️ YouTube
Finite-Time Lyapunov Exponents of Deep Neural Networks
Bernhard Mehlig, Department of Physics, University of Gothenburg, Sweden
DOI: 10.1103/PhysRevLett.132.057301
27. April 2026 @ 5pm CEST — ▶️ YouTube
It's not a Lottery, it's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task
Hannah Pinson, Eindhoven University of Technology, Netherlands
arXiv: https://arxiv.org/abs/2602.04832
13. April 2026 @ 5pm CEST — ▶️ YouTube
Sustainable Development and Energy Efficiency in Deep Learning
Raphael Fischer, TU Dortmund and Lamarr Institute, Germany
arXiv: https://arxiv.org/abs/2509.22092
30. March 2026 @ 5pm CEST — ▶️ YouTube
s1: Simple test-time scaling
Niklas Muennighoff, Stanford University, Allen Institute for AI, Contextual AI, USA
arXiv: https://arxiv.org/abs/2501.19393
16. March 2026 @ 5pm CET — ▶️ YouTube
Procedural Pretraining: Warming Up Language Models with Abstract Data
Liangze Jiang, EPFL and Idiap Research Institute, Switzerland
Zachary Shinnick, Australian Institute for Machine Learning (AIML), Adelaide University, Australia
arXiv: https://arxiv.org/pdf/2601.21725
9. March 2026 @ 5pm CET — ▶️ YouTube
How Does Sharpness-Aware Minimization Minimize Sharpness?
Kaiyue Wen, Stanford University, USA
arXiv: https://arxiv.org/abs/2211.05729
2. March 2026 @ 5pm CET — ▶️ YouTube
When Flatness Does (Not) Guarantee Adversarial Robustness
Nils Philipp Walter, CISPA Helmholtz Center for Information Security, Germany
arXiv: https://arxiv.org/pdf/2510.14231
9. February 2026 @ 5pm CET — ▶️ YouTube
Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures
Yedi Zhang, Gatsby Computational Neuroscience Unit, University College London, UK
arXiv: https://arxiv.org/pdf/2512.20607
The paper on Muon Yedi mentioned in the talk is now on arXiv: https://arxiv.org/abs/2603.00742
19. January 2026 @ 5pm CET — ▶️ YouTube
Fast Video Generation (multiple papers)
Rahim Entezari, Wayve.ai
12. January 2026 @ 5pm CET — ▶️ YouTube
Flatness is Necessary, Neural Collapse is Not: Rethinking Generalization via Grokking
Ting Han, Lamarr Institute, TU Dortmund, Germany and Institute for AI in Medicine, UK Essen
OpenReview: https://openreview.net/pdf?id=lbtOctHDQ3
Contact us for questions or suggestions via efficientml@gmail.com.
Self-nominations to present your published work in the reading group are welcome.