Date & Time: Every Friday 16:00 ~ 17:00 (CEST), Summer Semester 2026
Place: University of Bonn, Mathematics Building, Room 0.008.
Seminar overview: Large language models (such as ChatGPT, Gemini, Claude, etc) have become a tool for many research mathematicians. It seems plausible that this trend will continue, especially in light of impressive performances in benchmarks such as the First proof challenge.
The goal of this seminar is to learn about how these systems work. We will particularly focus on the mathematics behind these systems, which appears to be partly but not fully understood. We will (hopefully) also have some talks about engineering aspects.
Participants are not expected to know anything about AI! The first few talks will be distributed at the introductory meeting, and the subsequent talks will be determined based on participants' interests.
This talk will introduce the transformer architecture, assuming no background in machine learning. We will also distribute the talks for the subsequent weeks.
A Mathematical Perspective on Transformers, Geshkovski, B., Letrouit, C., Polyanskiy, Y., & Rigollet, P. arXiv:2312.10794
The Mathematics of Artificial Intelligence, Peyré, G., https://arxiv.org/pdf/2501.10465
Large Language Models: A Mathematical Formulation, Baptista, R., Stuart, A., Tran, S., https://arxiv.org/pdf/2601.22170
8 May (Fri) Transformers as a dynamical system. Speaker: Maximilian Schimpf (MPIM)
Abstract: We start by noting that the repeated filtering of information through the attention layers of a transformer can be interpreted as a discrete dynamical system. The continuous limit of this system satisfies a differential equation, which can be interpreted under appropriate assumptions as a gradient flow. After explaining this, we will see how this gradient flow has a tendency to form (often metastable) clusters.
This perhaps offers a partial explanation of why a chatbot can extract "meaning" from input text.
The Mean-Field Dynamics of Transformers, Rigollet, P; arXiv:2512.01868v4, see especially Sections 2, 3, & 4.
A Mathematical Perspective on Transformers, Geshkovski, B., Letrouit, C., Polyanskiy, Y., & Rigollet, P. arXiv:2312.10794
Dynamic metastability in the self-attention model: Geshkovski, Koubbi, Polyanskiy, Rigollet. arXiv:2410.06833 — This proves a conjecture from the previous paper
Sinkformers: Transformers with doubly stochastic attention: Sander, M. E., Ablin, P., Blondel, M., & Peyré, G. https://arxiv.org/abs/2110.11773
Do Residual Neural Networks discretize Neural ODEs?: Sander, M., Ablin, P., & Peyré, G. arxiv: 2205.14612
A Mathematical Theory of Attention: Vuckovic, J., Baratin, A. , Tachet des Combes, R. arXiv:2007.02876 (2020)
15 May (Fri) No talk due to public holiday on May 14
5 June (Fri) Expressivity of transformers. Speaker: TBA
Let us imagine that there existed an optimal ''next word prediction'' function \Gamma. For example, if we entered "Is the Riemann hypothesis true?'', then \Gamma would output the answer.
Assuming that \Gamma exists, we can ask whether it is possible to approximate it with the transformers algorithm. In other words, does there exist a choice of weights such that, given any input, the transformers algorithm outputs a word whcih is 'close' to \Gamma?
A version of this question is formalised and studied in
Transformers are Universal In-context Learners, Furuya, T., de Hoop, M., Peyré, G.; https://arxiv.org/pdf/2408.01367
The goal of this talk is to explain this circle of ideas, focusing perhaps on the aforementioned paper.
12 Jun (Fri) Engineering: how does one actually train an LLM in practice? Speaker: Mehdi Ali
TBA
21 Jun (Fri) In-context learning. Speaker: Illia Karabash
Transformers have the remarkable ability to learn generalize from examples -- in the literature, this is typically called ''in-context learning''. For example, if you enter "France: Paris. Spain:Madrid. Germany: " into your favorite chatboat, then the chatbot will presumably output ''Berlin''. (Strictly speaking, most consumer-oriented models will probably output something closer to "Great question! The capital of Germany is Berlin. Would you like me to tell you about other capital cities in Europe?" because they want to keep users hooked, but this sort of model behavior is, apparently, added during post-training and rather easy to modify.)
This talk will discuss the mathematics of how transformers achieve in-context learning. In particular, it appears that transformers are able to implement some form of a gradient descent algorithm *during inference*.
Suggested sources
Towards Understanding the Universality of Transformers for Next-Token Prediction, Sander, M. Peyré, G.; https://arxiv.org/abs/2410.03011 It might make sense to present the main theorem of this paper and the causal descent method.
Dynamic metastability in the self-attention model, Sander, M. Giryes, R. Suzuki, T. Blondel, M. Peyré, G.. https://arxiv.org/pdf/2402.05787 — A precursor of the paper above.
Transformers implement functional gradient descent to learn non-linear functions in context, Cheng, X., Chen, Y., Sra, S.; In Forty-first International Conference on Machine Learning, 2024.
Transformers learn in-context by gradient descent, Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., Vladymyrov, M.; In International Conference on Machine Learning, pp. 35151–35174. PMLR, 2023a. https://arxiv.org/pdf/2212.07677
Subsequent talks to be announced based on participant interests.