Mean-field Optimization for Two-layer Neural Networks

The optimization theory of two-layer neural networks in mean-field regime (mean-field neural networks) was developed by [Nitanda and Suzuki (2017)], [Chizat and Bach (NeurIPS2018)], [Mei, Montanari, and Nguyen (PNAS2018)]. This theory showed that the dynamics of the gradient descent for mean-field neural networks can be described as a Wasserstein gradient flow in the space of probability distributions. Subsequently, the global convergence theories for mean-field neural networks were provided by several research groups. However, the neural networks in the mean-field regime are generally difficult to analyze, and some additional conditions or regularizations are required to guarantee efficient convergence complexity. 

Mean-field Langevin Dynamics

The noisy (particle) gradient descent for mean-field neural networks can be described as mean-field Langevin dynamics [Hu et al. (2019)], whose evolution of distributions is nonlinear Fokker–Planck equation. Its convergence in both continuous- and discrete-time settings was shown in [Nitanda et al. (AISTATS2022)] with the simple proof which mirrors the classical convex optimization theory. Independently, [Chizat (TMLR2022)] also showed the same convergence result for continuous dynamics. The mean-field Langevin dynamics is basically an extension of (normal) Langevin dynamics into nonlinear functional and mean-field settings and our convergence analysis is also an extension of that of Langevin dynamics. The key in the theory is the proximal Gibbs distribution which makes the connection with convex optimization theory.

cvx_mf_langevin.pdf

Efficient optimization with PDA and P-SDCA

[Nitanda et al. (NeurIPS2021)] proposed the particle dual averaging (PDA) method and showed that a two-layer mean-field neural network can be optimized in polynomial time under the relative entropy (Kullback-Leibler divergence) regularization. This is the first study that provides quantitative convergence guarantees for mean-field neural networks.  Moreover, this result is refined by a subsequent study [Oko et al. (ICLR2022)] that proposed particle stochastic dual coordinate ascent (P-SDCA) with faster convergence guarantees for empirical risk minimization problems. The PDA and P-SDCA methods are extensions of the dual averaging method and stochastic dual coordinate ascent, developed for the optimization over finite-dimensional spaces, into the space of probability distributions. In general, the proposed methods can minimize convex functionals with negative entropy regularization, and hence they also can be regarded as nonlinear extensions of the Langevin algorithm which minimizes linear functionals with negative entropy regularization.

pda.pdf