Source: Vaswani et al., 2017
Theory of Transformer Neural Networks
We show that linear Transformers can perform in-context learning (ICL) by performing preconditioned gradient descent in its forward pass. More than providing a construction, we prove global/local optimality for one-layer/ multi-layer linear Transformers on the linear-regression ICL task. Our theoretical prediction aligns exactly with experimental observations. [NeurIPS 2023] [link to talk]
In a follow-up paper, we extend our analysis to learning non-linear functions using non-linear Transformers. [arxiv preprint]
We also observe that common algorithms such as SGD and ADAM behave very similarly on linear Transformers as they do on full-fledged Transformers. Our observations suggest that linear Transformers may serve as a useful model for understanding optimization of real Transformers. [preprint]
Geometric Diffusion on Riemannian Manifolds
We present an algorithm for sampling on a general Riemannian manifold. Our algorithm is based on the geometric Euler discretization of the Riemannian Langevin Diffusion. Our algorithm is efficiently implementable, and requires computing an exponential map in at each step. Our assumptions are general and encompass non-log-concave densities on negatively-curved manifolds.
We show that the algorithm achieves ε Wasserstein error from the target distribution in O(1/ε2) steps, matching the rate for Euclidean Langevin MCMC under similar assumptions. [NeurIPS 2022]
We also prove a generalized Central Limit Theorem, where non-Gaussian semi-martingales on a Riemannian manifold is approximated by a Riemannian diffusion process. This helps us study properties, such as generalization error (see section below) , of stochastic algorithms on manifolds. [preprint]
Diffusion Approximation and Learning
Stochastic Gradient Descent is often used in training neural networks. In addition to its computation benefits, it has been observed that the presence of SGD noise improves the generalization error of the trained network.
We explain this by showing SGD iterates with step-size δ is approximated by a diffusion process, whose diffusion matrix is given by the SGD covariance matrix. Mixing of the diffusion process implies a generalization guarantee for SGD using algorithmic stability.
We prove an approximation bound of O(δ1/8) between SGD and the diffusion process. Our theory suggests that injection of suitably chosen noise improves generalization, which we verify in practice [ICML 2020].
Using tools from Riemannian diffusion, we improve this approximation error to O(δ1/6). Additionally, our new theory works when the diffusion lies on a sub-manifold of. [preprint]
Sampling as Optimization in Wasserstein Space
I have designed and analyzed a number of Markov Chain Monte Carlo algorithms on Euclidean space based on the Langevin Diffusion:
In our [ALT 2018] paper, we show that Langevin MCMC converges to ε error in KL-divergence from the target distribution in O(1/ε) steps. Our analysis crucially uses the view of Langevin Diffusion as the gradient flow of KL-divergence in probability space.
We add momentum to Langevin MCMC to obtain the Underdamped Langevin MCMC algorithm. This leads to a quadratic speed-up in the Wasserstein convergene rate. [COLT 2018] Subsequently, the quadratic speedup was extended to the KL convergence rate as well using the gradient flow analysis. [Bernoulli 2021]
We analyze the convergence of both Langevin MCMC and Underdamped Langevin MCMC for non-log-concave target densities. The number of steps to obtain ε Wasserstein error are O(1/ε2) and O(1/ε) respectively, matching the log-concave setting. However, there is additional dependence on the inverse mixing rate which may (unavoidably) be exponentially small in dimension. [preprint]