Informal learning seminar on mathematics and AI

Date & Time: Every Friday 16:00 ~ 17:00 (CEST), Summer Semester 2026
Place: University of Bonn, Mathematics Building, Room 0.008.

Seminar overview: Large language models (such as ChatGPT, Gemini, Claude, etc) have become a tool for many research mathematicians. It seems plausible that this trend will continue, especially in light of impressive performances in benchmarks such as the First proof challenge.

The goal of this seminar is to learn about how these systems work. We will particularly focus on the mathematics behind these systems, which appears to be partly but not fully understood. We will (hopefully) also have some talks about engineering aspects.

Participants are not expected to know anything about AI! The first few talks will be distributed at the introductory meeting, and the subsequent talks will be determined based on participants' interests.

24 April (Fri) Introductory meeting. Speaker: Laurent Côté (Notes)

This talk will introduce the transformer architecture, assuming no background in machine learning. We will also distribute the talks for the subsequent weeks.

Some sources

A Mathematical Perspective on Transformers, Geshkovski, B., Letrouit, C., Polyanskiy, Y., & Rigollet, P. arXiv:2312.10794
The Mathematics of Artificial Intelligence, Peyré, G., https://arxiv.org/pdf/2501.10465
Large Language Models: A Mathematical Formulation, Baptista, R., Stuart, A., Tran, S., https://arxiv.org/pdf/2601.22170

8 May (Fri) Transformers as a dynamical system. Speaker: Maximilian Schimpf (MPIM)

Abstract: We start by noting that the repeated filtering of information through the attention layers of a transformer can be interpreted as a discrete dynamical system. The continuous limit of this system satisfies a differential equation, which can be interpreted under appropriate assumptions as a gradient flow. After explaining this, we will see how this gradient flow has a tendency to form (often metastable) clusters.

This perhaps offers a partial explanation of why a chatbot can extract "meaning" from input text.

Suggested sources

The Mean-Field Dynamics of Transformers, Rigollet, P; arXiv:2512.01868v4, see especially Sections 2, 3, & 4.
A Mathematical Perspective on Transformers, Geshkovski, B., Letrouit, C., Polyanskiy, Y., & Rigollet, P. arXiv:2312.10794

Other references

Dynamic metastability in the self-attention model: Geshkovski, Koubbi, Polyanskiy, Rigollet. arXiv:2410.06833 — This proves a conjecture from the previous paper
Sinkformers: Transformers with doubly stochastic attention: Sander, M. E., Ablin, P., Blondel, M., & Peyré, G. https://arxiv.org/abs/2110.11773
Do Residual Neural Networks discretize Neural ODEs?: Sander, M., Ablin, P., & Peyré, G. arxiv: 2205.14612
A Mathematical Theory of Attention: Vuckovic, J., Baratin, A. , Tachet des Combes, R. arXiv:2007.02876 (2020)

15 May (Fri) No talk due to public holiday on May 14

12 Jun (Fri) In-context learning. Speaker: Illia Karabash (Uni Bonn)

Transformers have the remarkable ability to learn generalize from examples -- in the literature, this is typically called ''in-context learning''. For example, if you enter "France: Paris. Spain:Madrid. Germany: " into your favorite chatboat, then the chatbot will presumably output ''Berlin''. (Strictly speaking, most consumer-oriented models will probably output something closer to "Great question! The capital of Germany is Berlin. Would you like me to tell you about other capital cities in Europe?" because they want to keep users hooked, but this sort of model behavior is, apparently, added during post-training and rather easy to modify.)

This talk will discuss the mathematics of how transformers achieve in-context learning. In particular, it appears that transformers are able to implement some form of a gradient descent algorithm *during inference*.

Suggested sources

Towards Understanding the Universality of Transformers for Next-Token Prediction, Sander, M. Peyré, G.; https://arxiv.org/abs/2410.03011 It might make sense to present the main theorem of this paper and the causal descent method.

Other references

Dynamic metastability in the self-attention model, Sander, M. Giryes, R. Suzuki, T. Blondel, M. Peyré, G.. https://arxiv.org/pdf/2402.05787 — A precursor of the paper above.
Transformers implement functional gradient descent to learn non-linear functions in context, Cheng, X., Chen, Y., Sra, S.; In Forty-first International Conference on Machine Learning, 2024.
Transformers learn in-context by gradient descent, Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., Vladymyrov, M.; In International Conference on Machine Learning, pp. 35151–35174. PMLR, 2023a. https://arxiv.org/pdf/2212.07677

19 Jun (Fri) Engineering: how does one actually train an LLM in practice? Speaker: Mehdi Ali (Uni Bonn -- Lamarr-Institut) Lamarr-Institut

Large language models (LLMs) have demonstrated disruptive potential across research

and industry, yet their development remains concentrated among a small number of

largely non-European organizations, raising concerns about access, competitiveness,

and technological sovereignty. This talk provides an end-to-end overview of what it

actually takes to train a competitive LLM, drawing on the experience of the

OpenGPT-X and Soofi projects. We examine the full training pipeline, from data

curation through distributed training and compute-optimal scaling to few-shot

evaluation, and present results from our recent multilingual models.

3 July (Fri) Why Transformers can (in theory) do anything. Speaker: Maximilian Schimpf (MPIM)

"Let us imagine that there existed an optimal ''next word prediction'' function \Gamma. For example, if we entered "Is the Riemann hypothesis true?'', then \Gamma would output the answer. Assuming that \Gamma exists, we can ask whether it is possible to approximate it with the transformers algorithm. This talk will present a positive answer to this question given in the paper "Transformers are Universal In-context Learners" by Furuya, de Hoop and Peyré."

Subsequent talks to be announced based on participant interests.

Page updated

Google Sites

Report abuse