Stochastic Approximation beyond Gradient for Machine Learning:
Taming Noise with Guarantees.
Stochastic Approximation (SA) is a classical iterative algorithm that has a long history of over 70 years. The goal of the SA scheme is to determine the roots of a nonlinear system when the mean field cannot be explicitly computed but a random oracle exists. Recently, the spectrum of use of SA schemes has widened considerably with the applications to statistical machine learning, due to the necessity to deal with a large amount of data observed with uncertainties.
An extensive literature on stochastic optimization is devoted to the stochastic (sub)gradient (SG) algorithm, which is by far the most popular application of the SA scheme. The stochastic gradient algorithms are characterized by having an update recursion featuring an unbiased mean field which is the gradient of a loss function to be minimized. However, a lesser known fact is that SA scheme also includes non-gradient stochastic algorithms whose expected oracles are not the gradients of any function and whose oracles are possibly biased. The first step of this course is to overview non-gradient perspectives of SA to a statistical machine learning audience: compressed stochastic Gradient, stochastic Expectation-Maximization and more generally stochastic Majorize-Minimization, Temporal-Difference for Reinforcement Learning.
Most of the existing overviews on the convergence for SA schemes are restricted to the Gradient-case, and sometimes not comprehensive when it comes to discussing the results for non-gradient algorithms. In addition, they are essentially limited to asymptotic convergence for algorithms with decreasing step sizes. Possible reasons behind this, are the lack of a proper Lyapunov function to set the convergence analysis framework, and the potential bias which may destabilize the SA recursion. The second step is to present a design guideline of SA algorithms backed by theories: propose a general framework that unifies existing theories of SA and applies to non-gradient SA, and develop convergence results with an emphasis on non-asymptotic convergence analysis.
Finally, in the last part of the course, we will review recent advances in SA, such as variance reduction within SA: we will discuss, by providing an algorithmic description and a non-asymptotic convergence analysis for a general SA scheme, the Stochastic Path-Integrated Differential EstimatoR method (SPIDER) originally introduced for stochastic gradient algorithms and then extended to Expectation-Maximization.
The course is based on joint works with Aymeric Dieuleveut (CMAP, Ecole Polytechnique, France), Eric Moulines (ML Dpt, MBZUAI, UAE), and Hoi-To Wai (Dpt of SEEM, Chinese University of Hong-Kong).
All the slides are here
Slides for Lesson 1
Slides for Lesson 2
Slides for Lesson 3
Slides for Lesson 4