Séance du 10 juin 2024

Séance organisée par Victor-Emmanuel Brunel et Jaouad Mourtada

Lieu : IHP,  amphi Hermite 


14.00 : Borjan Geshkovski (Sorbonne Université et INRIA, LJLL),

Titre :  Dynamic metastability in the self-attention model

Résumé : The pure self-attention model is a simplification of the celebrated Transformer architecture, which neglects multi-layer perceptron layers and includes only a single inverse temperature parameter.Despite its apparent simplicity, the model exhibits a remarkably similar qualitative behavior across layers to that observed empirically in a pre-trained Transformer. Viewing layers as a time variable, the self-attention model can be interpreted as an interacting particle system on the unit sphere. We show that when the temperature is sufficiently high, all particles collapse into a single cluster exponentially fast. On the other hand, when the temperature falls below a certain threshold, we show that although the particles eventually collapse into a single cluster, the required time is at least exponentially long. This is a manifestation of dynamic metastability: particles remain trapped in a "slow manifold" consisting of several clusters for exponentially long periods of time. Our proofs make use of the fact that the self-attention model can be written as the gradient flow of a specific interaction energy functional previously found in combinatorics.


15.00 : Antoine Godichon-Baggioni (Sorbonne Université, LPSM)

Titre :  Stochastic Newton algorithms with O(Nd) operations

Résumé : The majority of machine learning methods can be regarded as the minimization of an unavailable risk function. To optimize this function using samples provided in an online fashion, stochastic gradient descent is a common tool. However, it can be highly sensitive to ill-conditioned problems. To address this issue, we focus on Stochastic Newton methods. We first examine a version based on the Ricatti (or Sherman-Morrison) formula, which allows recursive estimation of the inverse Hessian with reduced computational time. Specifically, we show that this method leads to asymptotically efficient estimates and requires$O(Nd^2)$ operations (where N is the sample size and d is the dimension). Finally, we explore how to adapt the Stochastic Newton algorithm for a streaming context, where data arrives in blocks, and demonstrate that this approach can reduce the computational requirement to $ O(Nd) $ operations.


16.00 : Mohamed Ndaoud (ESSEC)

Titre : TBA

Résumé : TBA