Source: Vaswani et al., 2017

Theory of Transformer Neural Networks

We show that linear Transformers can perform in-context learning (ICL) by performing preconditioned gradient descent in its forward pass. More than providing a construction, we prove global/local optimality for one-layer/ multi-layer linear Transformers on the linear-regression ICL task. Our theoretical prediction aligns exactly with experimental observations. [NeurIPS 2023] [link to talk] 

In a follow-up paper, we extend our analysis to learning non-linear functions using non-linear Transformers. [arxiv preprint]

We also observe that common algorithms such as SGD and ADAM behave very similarly on linear Transformers as they do on full-fledged Transformers. Our observations suggest that linear Transformers may serve as a useful model for understanding optimization of real Transformers. [preprint]

Geometric Diffusion on Riemannian Manifolds

We present an algorithm for sampling on a general Riemannian manifold. Our algorithm is based on the geometric Euler discretization of the Riemannian Langevin Diffusion. Our algorithm is efficiently implementable, and requires computing an exponential map in at each step. Our assumptions are general and encompass non-log-concave densities on negatively-curved manifolds.

We show that the algorithm achieves ε Wasserstein error from the target distribution in O(1/ε2) steps, matching the rate for Euclidean Langevin MCMC under similar assumptions. [NeurIPS 2022]

We also prove a generalized Central Limit Theorem, where non-Gaussian semi-martingales on a Riemannian manifold is approximated by a Riemannian diffusion process. This helps us study properties, such as generalization error  (see section below) , of stochastic algorithms on manifolds. [preprint]

Diffusion Approximation and Learning

Stochastic Gradient Descent is often used in training neural networks. In addition to its computation benefits, it has been observed that the presence of SGD noise improves the generalization error of the trained network.

We explain this by showing SGD iterates with step-size δ is approximated by a diffusion process, whose diffusion matrix is given by the SGD covariance matrix. Mixing of the diffusion process implies a generalization guarantee for SGD using algorithmic stability.

We prove an approximation bound of  O(δ1/8) between SGD and the diffusion process. Our theory suggests that injection of suitably chosen noise improves generalization, which we verify in practice [ICML 2020]

Using tools from Riemannian diffusion, we improve this approximation error to O(δ1/6). Additionally, our new theory works when the diffusion lies on a sub-manifold of. [preprint]

Sampling as Optimization in Wasserstein Space

I have designed and analyzed a number of Markov Chain Monte Carlo algorithms on Euclidean space based on the Langevin Diffusion: