Preconditioned stochastic gradient descent (PSGD)

PSGD is a second-order stochastic optimization method. It differentiates itself from most methods by its inherent abilities to handle nonconvexity and gradient noise. The affine group and the low-rank approximation (LRA) preconditioners are the two most useful ones among many possible choices. Please check my  Pytorch implementation for  details.  The Tensorflow and Numpy implementations are not actively maintained. 

Overview of PSGD

PSGD estimates a preconditioner to scale the gradient in a way closely comparable to the Newton method. It applies to both convex and non-convex optimizations with exact or noisy gradients. Let us define the following variables: x is the parameter vector to be optimized; g is the exact or stochastic gradient at x; dx is an arbitrarily small random perturbation around x; dg is the associated gradient perturbation due to dx; P is the estimated preconditioner around x. PSGD updates x as

x[new] = x[old]    u P g

where 0 < u < 1 is a normalized step size, and the preconditioner is updated along with x by minimizing the following criterion

E[ dgT P dg + dxT P -1 dx ] .

Here operator E takes expectation. Under mild conditions, there exists a unique positive definite P to minimize the above criterion such that

P E[ dg dgT ] P = E[ dx dxT ] .

Now, it is clear that P is comparable to the inverse of Hessian by noting fact H -1 dg dgT H -1 = dx dxT when g is exact, where H is the Hessian at x. In most of my PSGD implementations, we let P = QT Q, and instead of fitting P directly, we fit Q mainly for one reason: efficiency and numerical stability whenever Q can form a Lie group. There are numerous choices on the detailed forms of Q. For large-scale problems, we could significantly accelerate the convergence even when Q has simple forms with limited number of parameters. PSGD is quite different from Quasi-Newton methods on preconditioner estimation. Section IV.C in paper https://arxiv.org/abs/1512.04202 explains the reasons on why these traditional methods might not be good for stochastic optimizations (even for convex ones).

Very old (before Tensorflow!) Matlab implementation (no longer maintained)

RNN training examples written in Matlab: Report on solving pathological RNN problems; Supplement; Matlab code; Why delayed XOR problem is hard

The developments of PSGD