Optimizer

Mini-batch gradient descent updates the parameters with a mini-batch/ subset of training examples. The direction of the update has some variance, and so the path taken by mini-batch gradient descent will “oscillate” toward convergence.

During backward propagation, we use dW and db to update our parameters W and b as follows:

W = W – learning rate * dW

b = b – learning rate * db

We use discuss some other methods which can be used to update the parameters W and b and then we will explain the reason for doing so.

Gradient Descent with Momentum

Gradient Descent with Momentum considers the past gradients to smooth out the update. It computes exponentially weighted moving average of the gradients, and then use that gradient to update the weights instead. It works faster than the standard gradient descent algorithm.

Implementing Gradient Descent with Momentum

In momentum, instead of using dW and db independently for each epoch, we take the exponentially weighted averages of dW and db.

  • Initialize SdW and Sdb to zero

  • On iteration t, compute the derivatives dw & db using current mini-batch.

  • Update VdW and Vdb

VdW = β x VdW + (1 – β) x dW

Vdb = β x Vdb + (1 – β) x db

  • Update parameters W and b

W = W – learning rate *VdW

b = b – learning rate * Vdb

Where beta ‘β’ is another hyper-parameter called momentum and ranges from 0 to 1. It sets the weight between the average of previous values and the current value to calculate the new weighted average.

Selecting Momentum

  • The momentum (beta) must be higher to smooth out the update because we give more weight to the past gradients.

  • It is recommended to use the default value for β = 0.9 but if required, it can be tuned between 0.8 to 0.999.

  • Momentum takes past gradients into account to smooth out the steps of gradient descent. It can be applied with batch gradient descent, mini-batch gradient descent or stochastic gradient descent.

RMS Prop

Root mean square prop or RMSprop is using the same concept as the exponentially weighted average of the gradients like gradient descent with momentum but the difference is the update of parameters.

Implementing RMS prop

In RMSprop, instead of using dW and db independently for each epoch, we take the exponentially weighted averages of the square of dW and db.

  • Initialize SdW and Sdb to zero

  • On iteration t, compute the derivatives dw & db using current mini-batch.

  • Update SdW and Sdb

SdW = β * SdW + (1 – β) * dW2

Sdb = β * Sdb + (1 – β) * db2

  • Update parameters W and b

W = W – learning rate *dW / sqrt(SdW)

b = b – learning rate * db / sqrt(Sdb)

Where beta ‘β’ is another hyper-parameter and takes values from 0 to 1. It sets the weight between the average of previous values and the square of current value to calculate the new weighted average

SdW is relatively small so that here we’re dividing dW by relatively small number whereas Sdb is relatively large so that here we’re dividing db with a relatively larger number to slow down the updates on a vertical dimension.

Adam Optimization Algorithm

The Adaptive Moment Estimation or Adam optimization algorithm is one of those algorithms that work well across a wide range of deep learning architectures. It is recommended by many well-known neural network algorithm experts. The Adam optimization algorithm is a combination of gradient descent with momentum and RMSprop algorithms.

Some advantages of Adam include:

  • Relatively low memory requirements (though higher than gradient descent and gradient descent with momentum).

  • Usually works well even with a little tuning of hyperparameters (except alpha).

Implementing ADAM

  1. First, it calculates an exponentially weighted average of past gradients, and stores it in variables VdW & Vdb(before bias correction) and VdWcorrected & Vdbcorrected (with bias correction).

  2. Then it calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables SdW & Sdb (before bias correction) and SdWcorrected & Sdbcorrected (with bias correction).

  3. Finally updates parameters in a direction based on combining information from “1” and “2”.

Steps to implement:

  • Initialize VdW, SdW, Vdb and Sdb to zero.

  • On iteration t, compute the derivatives dw & db using current mini-batch.

  • Update VdW and Vdb like momentum.

VdW = ß1 x VdW + (1- ß1) x dW

Vdb = ß1 x Vdb + (1 – ß1) x db.

  • Update SdW and Sdb like RMSprop.

SdW = ß2 x SdW + (1- ß2) x dW2

Sdb = ß2 x Sdb + (1 – ß2) x db2.

  • In Adam optimization implementation, we do implement bias correction.

VdWcorrected = VdW / (1- ß1t)

Vdbcorrected = Vdb / (1- ß1t)

SdWcorrected = SdW / (1- ß2t)

Sdbcorrected = Sdb / (1 – ß2t)

  • Update parameters W and b.

W = W – learning rate x (VdWcorrected / sqrt(SdWcorrected+ ε))

b = b – learning rate x (Vdbcorrected / sqrt(Sdbcorrected+ ε))

where:

  • epsilon ‘ε’ is a very small number to avoid dividing by zero (epsilon = 10-8­).

  • ß1 and ß2 are hyper parameters that control the two exponentially weighted averages. In practice we use the default values for ß1 = 0.9 and ß2 = 0.999.

  • Alpha is the learning rate and a range of values to be tested to see what works best for different problems.

So what do these optimizer Algorithms basically do?

Consider an example where we are trying to optimize a cost function which has contours like below and the red dot denotes the position of the local optima (minimum).

We start gradient descent from point ‘A’ and after one iteration of gradient descent we may end up at point ‘B’, the other side of the ellipse. Then another step of gradient descent may end up at point ‘C’. With each iteration of gradient descent, we move towards the local optima with up and down oscillations. If we use larger learning rate then the vertical oscillation will have higher magnitude. So, this vertical oscillation slows down our gradient descent and prevents us from using a much larger learning rate.

This is where algorithms like Momentum & RMSprop comes into play.

In the case of Momentum, by using exponentially weighted average of dW and db, we tend to average out the oscillations in the vertical direction close to zero. Whereas, on the horizontal direction, all the derivatives are pointing towards the right, so the average in the horizontal direction will still be pretty big. It allows our algorithm to take more straight forwards path towards local optima and damp out vertical oscillations. To achieve this, we use a hyper-parameter beta ‘β’ called momentum that ranges from 0 to 1. It sets the weight between the average of previous values and the current value to calculate the new weighted average.

In case of RMSprop, we slow down the update for the bias which is responsible for vertical oscillations. By slowing down the update for bias, we dampen the vertical oscillations and if we update weights with higher value, then we can still move towards local optima.

In the case for Adam, we use both of the above approaches.

Reference