Preconditioned stochastic gradient descent (PSGD)
PSGD is a second-order stochastic optimization method. It differentiates itself from most methods by its inherent abilities to handle nonconvexity and gradient noise. The affine group and the low-rank approximation (LRA) preconditioners are the two most useful ones among many possible choices. Please check my Pytorch implementation for details. The Tensorflow and Numpy implementations are not actively maintained.
Overview of PSGD
PSGD estimates a preconditioner to scale the gradient in a way closely comparable to the Newton method. It applies to both convex and non-convex optimizations with exact or noisy gradients. Let us define the following variables: x is the parameter vector to be optimized; g is the exact or stochastic gradient at x; dx is an arbitrarily small random perturbation around x; dg is the associated gradient perturbation due to dx; P is the estimated preconditioner around x. PSGD updates x as
x[new] = x[old] u P g
where 0 < u < 1 is a normalized step size, and the preconditioner is updated along with x by minimizing the following criterion
E[ dgT P dg + dxT P -1 dx ] .
Here operator E takes expectation. Under mild conditions, there exists a unique positive definite P to minimize the above criterion such that
P E[ dg dgT ] P = E[ dx dxT ] .
Now, it is clear that P is comparable to the inverse of Hessian by noting fact H -1 dg dgT H -1 = dx dxT when g is exact, where H is the Hessian at x. In most of my PSGD implementations, we let P = QT Q, and instead of fitting P directly, we fit Q mainly for one reason: efficiency and numerical stability whenever Q can form a Lie group. There are numerous choices on the detailed forms of Q. For large-scale problems, we could significantly accelerate the convergence even when Q has simple forms with limited number of parameters. PSGD is quite different from Quasi-Newton methods on preconditioner estimation. Section IV.C in paper https://arxiv.org/abs/1512.04202 explains the reasons on why these traditional methods might not be good for stochastic optimizations (even for convex ones).
Very old (before Tensorflow!) Matlab implementation (no longer maintained)
RNN training examples written in Matlab: Report on solving pathological RNN problems; Supplement; Matlab code; Why delayed XOR problem is hard
The developments of PSGD
Preconditioned stochastic gradient descent, arXiv:1512.04202, 2015. (The general theory of PSGD.)
Preconditioner on matrix Lie group for SGD, arXiv:1809.10232, 2018. (Focus on preconditioners on the affine Lie group.)
Black box Lie group preconditioners for SGD, arXiv:2211.04422, 2022. (Mainly about the low-rank approximation (LRA) preconditioner. I also have prepared these supplementary materials for readers without Lie group background.)