neural network

Draft notes. neural networks and deep learning (http://neuralnetworksanddeeplearning.com/)

Link to back propagation algorithm details

Link to Softmax details

Link to Cross Entropy cost function details

Link to CNN formulas

1. Brief

A Neural Network has layers of neurons. The first layer is considered as the model input and simply emits input values. The last layer outputs results.

Every layer takes inputs from the previous layer. A link from a neuron to another neuron in the next layer carries a weight w.

Every neuron itself has a bias (or triggering threshold), so the neuron is fired only if the total input value is larger than the bias.

Therefore, every neuron can be modelled as wX + b.

w = [w1, w2, .. wn] are weights for the n input links. X = [x1, x2, ... xn]^T are the outputs from the previous layer / external signals.

If wX + b > 0 which means the total input value wX > b, then the neuron is fired and outputs 1, otherwise wX + b <= 0 and the output is 0.

This is a stepwise function, so the output goes from 1 to 0 or 0 to 1.

2. Sigmoid function

In gradient search of optimal solution for the model (which will be discussed later), we don't want the result to flip from 0 to 1 or 1 to 1 suddenly which gives us no clue if a small delta movement in the search is approaching a better solution / moving away from the best.

So we introduce transform function (smoother) to convert wX + b to a curve.

An example is Sigmoid function = 1/(1+e^-x)

When x goes from -infinity to 0, the function value slowly goes from 0 to 1/2. When x is close to 0 (function value near 1/2), the value goes up very quickly (simulates flipping from 0 to 1).

When x goes from 0 to infinity, the function further goes from 1/2 to 1 very quickly in the beginning and then much slower.

We can then use the sigmoid function value as the output from a neuron instead of the 0/1 stepwise function.

In this way, a delta change in w and b will also cause a change in the output so we know if the search is going towards the right direction or not.

The sigmoid function here is also called activation function (to activate the neuron). Other activation functions are used as well.

3. Feed forward networks

Neural networks that allow only feed forward flow (no loop) is called feed forward networks. We mainly talk about feed forward networks here.

Some neural networks has feedback loops, which mean a neuron's output can come back to a previous layer. This is called Recurrent Neural Networks (RNN). The idea is to allow some neurons to fire back for a limited duration of time, which causes a cascade of neurons firing later.

RNN has less influence than feed forward networks, but RNN is closer to how our brains work and it solves some important problems that are diffiult to be solved by feed forward networks.

As mentioned before, the first layer of a network is called the input layer. It simply emits values.

The last layer is the output layer which outputs the result. E.g. two neurons represent 0 and 1 respectively.

For number recognition, there can be 10 neurons in the output layer, representing 0 - 9 respectively. Someone may encode the 10 different numbers using 4 digit binary code and use only 4 neurons in the last layer. That might not be a good idea because there will be an extra layer (more complexity) in encoding the numbers.

Recall the weights and biases.

Denote the weights of links to layer L as w_L and the biases of layer L as b_L. w_L is matrix and b_L is a vector.

Let the input to the L layer (i.e. the output from the L-1 layer as X).

w_L * X + b is the yields from layer L. Sigmoid(w_L * X + b) is the output from layer L.

In a feed forward network, staring from initial input x, one can derive the ouput of each layer using w_L * X + b layer by layer.

4. Gradient Descent

The target of training a neural networks is to tweak the weights w and bias b of neurons such that the output of the network matches the actual outcome as much as possible (i.e. the precision).

The idea of gradient descent is to continuously adjust w and b slightly towards the direction that increases the precision until it reaches the optimal result.

However the precision is not a smooth mathematical function and it's too hard to get the derivative of it, or sometimes changing the w and b slightly doesn't change the precision at all.

Alternatively, we need an objective function that is smoother than the precision, which can be absolute error, mean square error, etc. Those functions are smooth enough to guide the gradient descent search.

Regarding why not solve the objective function to find the optimal solution, it's usually impossible to find the analytical solution given hundreds or even millions of weights as variables in the network. Sometimes there is no analytical solution at all, so one can only approximates the solution.

The gradient descent algorithm simulates rolling down a ball on the slope of a valley. It eventually rolls to the bottom of the valley due to gravity attraction. 

The algorithm here simply uses the derivative (gradient) of the object (loss) function.

  Cost = 1/2n * sum(y' - y)^2 

For all training samples, sum up the difference between the network output y' and the actual y. This is a quadratic function so we can get the derivative of it w.r.t a particular set of w and b. Changing w and b slightly will cause the Loss to change slightly as well.

Denote the gradient/derivative of Cost as dCost = (dw1, dw2, dw3 ... dwn, db1, db2, ... dbn)^T with respect to w1...wn and b1.. bn

For every round of gradient descent, the ball moves a small delta in the opposite direction of dCost, -eta * dCost, where eta is a small positive number also called learning rate.

So the cost is always changed by (-eta * dCost) * dCost = -eta * |dCost| which is always decreasing.

Assume the ball is at position v, then the new position is v' = v - eta * dCost after one move.

Repeating this movement again and again will lead the ball to the bottom (global minimum cost) as it always moves in the direction that reduce the cost (i.e. the opposite direction of dCost).

The learning rate eta needs to be small enough to approximate the movement at a given point. But it cant be too small otherwise it takes too many rounds to get the bottom. Choosing a proper eta is usually an important thing in machine learning.

Cost = 1/2n * sum(y' - y)^2 = 1/n * sum(Cost_x), x in all training samples

Cost_x = 1/2(y'-y)^2 is the cost for sample x.

In practice, to calculate the gradient dCost, we can calculate the gradient dCost_x for each training sample x and then average them.

i.e. dCost = 1/n * sum(dCost_x)

Be aware that NOT all cost functions can be calculated as an average of the sum of individual costs.

5. Stochastic Gradient Descent

To calculate the gradient/derivative of Cost, we need to sum up the graident for every training sample and average them. When the size of samples n is too big, e.g. millions, the calculation can be very slow. i.e. dCost = 1/n * sum(dCost_x) takes too long to calculate.

Alternatively, we can take a random subset of the samples (m samples) and use derivative of the subset as an approximation.

dCost = 1/n * sum(dCost_x) approximates 1/m * sum(dCost_x) when m is big enough.

The random subset is called mini-batch. 

When it finishes one movement guided by the random subset, it chooses another random subset (completely different to the previous ones) to continue. When all samples have been exhausted, it said to complete an Epoch of training.

Sometimes people omit the 1/n or 1/m in the Cost function (so no average). In theory it shouldn't change much but in practice there may be small difference due to the Cost is scaled by 1/n.

An extreme case is to use a mini-batch size of only 1, which means the training process adjusts the weights and bias one sample after another. This is also called Online. 

It supports training a network using 1 sample at a time. This is similar to how human learn from data. One at a time. Obviously this is quick and continuous, but it might adjust the network in the wrong way if a sample happens to be error; it may move towards a sub-optimal direction instead of a global optimal direction.

Calculating the gradient dCost_x/dw and dCost_x/db for every training sample is not straightforward though. The following Back Propagation algorithm is a solution.

6. Back Propagation

The purpose of Back Propagation is to help to calculate the gradient of Cost with respect to each weight w and each bias b.

Let's call the wx+b at each neuron the z value. so sigmoid(z) is the activation value / the output of the neuron.

A change of z value at a neuron can change the Cost value. dCost/dz_j is the derivative of Cost w.r.t. the z value of neuron j.

If dCost/dz_j is close to zero, then a small change of z value at neuron j will not change the Cost much at all, which implies that neuron j is already pretty near optimal.

If dCost/dz_j has a large value (either positive or negative), there is still potential to lower the Cost by changing the z value (in some way).

In this sense, define dCost/dz_j as Error.

The basic idea is to derive the Error of neurons for a layer based on the Error in the next layer.

Calculating the Error for the last layer (output layer) is simple enough by checking the difference between the actual value y and sigmoid(z value), i.e. the predicted value.

To get the Error of the previous layer, one only needs to combine the Error of the current layer, the weights of links to the current layer and the derivative of the activation function (e.g sigmoid) at the z value of the previous layer.

After running through a feedforward network layer by layer calculating wX +b, one has got the z values and activation values for every layer. Note this is for ONLY one training sample.

Then back propagation calcualtes the Error from the last layer backwards to the first layer.

Using the Error, you can further derive the derivative (gradient) with respect to weights and biases for every neuron in every layer.

Again this is the derivatives for ONLY one training sample.

To get the derivatives for all training samples, Back Propagation needs an assumption that the overall derivative is the sum of all individual sampless derivatives.

This assumption depends on the Cost function which needs two assumptions:

Assumption 1. 

The cost function can be written as an average over the costs for all individual training samples.

dCost = 1/n * sum(dCost_x)

The reason is back propagation calculates the partial drivatives dCost_x/dw for every single training example x and then get the final dCost/dw by averaging dCost_x/dw.

Assumption 2.

The cost can be written as a function of the outputfrom the neural network.

Cost = 1/2 * sum(y - output of the last layer)^2

Obviously a quadratic function Cost = 1/2n * sum(y' - y)^2 satifies both of the assumptions. Actually many other cost functions too.

1/2n * sum(y' - y)^2 = 1/2 * (y'_1 - y_1)^2 + 1/2 * (y'_2 - y_2)^2 + 1/2 * (y'_3 - y_3)^2 + ....

The details of Back Propagation:Back Propagation Algorithm 

Why back propagation is smart

The key of the gradient descent search is to calculate the partial derivative of Cost with respect to every weight and every bias.

A simple way to the calculate the Cost by adding a very small delta to a weight w, then dCost/dw approximates (Cost_new - Cost_old)/(w_new - w_old).

However to calculate the new Cost, it needs a feed forward process. For a different weight, it needs another feed forward. So if there are a million weights in the network you will need a million feed forwards to compute the gradients. It's EXTREMELY SLOW. This is what people faced in 1950s/60s.

The smart thing about back propagation is that it uses only one feed forward and one back propagation to compute all graidents. Much faster! Lucky people in 80s/90s.

7. Cross Entropy cost function

With quadratic cost function and sigmoid activation function, when the initial weights and biases are far away from optimal, the output value of the network tends to far away from the actual value as well. If the output z value is either too big (e.g. > 3) or too small (e.g. < -3) then the derivative of the sigmoid function is small too, because the sigmoid curve is very flat when the z value is too big or small. The consequence is that the network will learn very slowly as the derivative is small and the graidents with respect to weights and biases are small too. The graident descent search is only able to move a too small step. For a quadratic cost function, the gradients of cost with respect to weights w and biases b in the last layer are:

    dCost/dw = (a-y)sigmoid'(z)x

    dCost/dw = (a-y)sigmoid'(z)

The sigmoid'(z) term is causing the slow learning problem. Here x the activation values from the previous layer.

Cross Entropy is one solution to the slow learning problem in the last layer. Cross Entropy is a cost function that achieves big derivative when the weights and biases are far away from optimal and gets small derivative when its close to optimal. The function is defined as:

    Cross Entropy = -1/n sum[yln(a) + (1-y)ln(1-a)]  for all input x

where a is the activation value output from a network. y is 0 or 1 so one of the two terms in the sum is zero. activation value a = sigmoid(z) is between 0 and 1.

When y is 1 and a is far away from 1 (i.e. near 0), -yln(a) is very big and the derivative is big too.

When y is 0 and a is far away from 0 (i.e. near 1), -(1-y)ln(1-a) is very big and so is the derivative.

When a is close to y, the cost tends to be very small and so does the derivative.

The gradients of cost with respect to weights w and biases b in the last layer are:

    dCost/dw = (a-y)x

    dCost/dw = (a-y)

By using the ln() function it gets rid of the sigmoid'() term and purely relies on the difference between activation value a and the actual result y.

x is the activation values from the previous layer.

The details of Cross Entropy is Cross Entropy Function

The back propagation algorithm doesn't need to change anything for using the Cross Entroy function (assume sigmoid is used), but just update the calculation of the Errors for the last layer.

i.e. dCost/dz = a-y

Note, this helps only the LAST layer. If neurons in the hidden layer saturate (the a value near 0/1) the learning speed (related to the drivative of sigmoid da/dz) for the hidden layers is still low. That is why people also need to carefully select initial weights and biases to avoid saturated hidden layers if possible.

Another Note, Cross Entropy is essentially same to the Log Loss function used by Logistic Regression. It is the log of maximum likelihood which is the multiplication of each samples probability. 

The output activation value a should be between 0 and 1 to simulate a probability. Therefore, Cross Entropy cost function only works with an output layer whose value simulates probability. Usually Softmax layer (introduced below) is a good option. Sigmoid is not a good candidate for output layer although its value is between 0 and 1, however it still works. Activation function like Rectified Linear Unit (also introduced later below) wont work because its value is unbounded.

8. Softmax layer

Another way to avoid the slow learning problem with sigmoid function is to have a Softmax output layer. 

A Softmax layer is the last layer of a network. It converts a Z value to some sort of probability.

The activation value a_j = e^Zj / sum(e^Zi)

Firstly it rescales the Z value to (0-1) using expoenential function e^Zj and then divides it with the sum of all neurons' exponential values sum(e^Zi).

The activation value a_j is always between 0 and 1, and it's easy to show that the sum of all neurons' activation values sum(a_i) =1. So this is some form of probability distribution. At least it looks like a probability distribution. The activation value a_j is the network's estimate of the 'probobility' that the correct output is j.

The graidents of the Cost with respect to weights and biases of the last layer are:

   dCost/dw = (a-y)x

   dCost/db = (a-y)

This is exactly the same to what is achieved by the Cross Entropy function. x is the activation values from the previous layer.

The details of Softmax is Softmax

The back propagation algorithm needs to change. Just update the calculation of the Errors for the last layer.

i.e. dCost/dz = a - y

And also update the times factor (derivative of da/dz) in the Back Propagation algorithm.

9. Regularization

Penalty of large weights

To avoid overfitting we usually add regularization (penalty) term to the cost function. 

Take cross entropy as an example:

     Cross Entropy = -1/n sum[yln(a) + (1-y)ln(1-a)] + lambda/2n sum(w^2)

The regularization term is the sum of weight squares (L2 penalty). The lambda is a super parameter to adjust the weight of penalty in the cost function. Intuitively if lambda is large then it means we are less tolerated with the penalty because a small increase in penalty will cause big increase in the overall cost. If lambda is small then we allow relatively large penalty.

Note that the weight w is prefered to be small because large weights usually imply overfitting (refer to other books regarding that).

The 1/2 is simply the math trick to cancel the coefficient of the derivative of w^2. The 1/n is to use the regularization term in the same scale as the cost entropy.

Basically, whatever cost function can be added with the regularization term:

     Cost = Cost + lambda/2n sum(w^2)

There are also L1 regularization

     Cost = Cost + lambda/n sum(w)

This will change the gradient slightly in the gradient descent search.

dCost/dw is now added with a small term d[lambda/2n sum(w^2)]/dw = lambda/n * w

dCost/db has no change as b is not in the regularization term.

The back propagation algorithm remains unchanged.

In stochastic gradient descent, we average a mini batch of m training samples to estimate dCost/dw, i.e. 1/n*sum(dCost/dw) ~ 1/m*sum(dCost/dw)

But the regularization term still needs to be lambda/n * w. Use n here instead of m.

A brief discussion on regularization.

Basically the idea of regularization is to prefer simplicity. A simple solution is believed to model a problem better than a complex one. A simple (regularized and small weights) model learns to respond to only evidences widely seen in the training data. In contrast a complex model (big weights) can respond to rare samples. A small change in the input can cause big change to the output because of the big weights. Usually complex models learn too much about noises in the training data and thus overfit the data. Increasing the size of training data will help to reduce overfitting.

However, there is actually NO logical reason to believe that a simpler solution is better and sometimes a complex solution is better. An example is the orbit of Mercury. People observed that the orbit of the planet Mecury has a tiny deviation from what is calculated using Newton's law of gravity. That tiny deviation is well explained by Einsteins general theory of relativity. So given the orbit data of Mercury to learn a model of the orbits, if we prefer simple solution and explain the tiny deviation as noise, we will learn Newtons law of gravity which is not accurate. But if we try to learn a more complex solution to also explain the tiny deviation, we will get the Einsteins general theory of gravity which is correct. In daily business, the cleanness of data is rarely guaranteed though, so there is a lot higher chance to overfit the noise instead of learning a deep insight.

Note that biases don't need regularization. Biases are constants with neurons, no matter small or big inputs are added with the same biases. Biases only describe the different reactions of different neurons.

Here are a few things to keep in mind. It can be quite a subtle business deciding which of two explanations is truly simpler. Always be careful of using simplicity as a guide. Always test a model with data that has not been seen before to determine the true predicting power.

Human can regularize quite well. A kid only needs to see a few elephant photos to be able to recognize elephants amazingly well even the new elephants photos are vague or from different angles or cartoon figures. How does human regularize a model so well with only tiny training dataset is still unknown. I think human brain is not just trained with only the tiny dataset. To recognize an elephant, a kid has already seen many other animals and objects so he is able to tell an animal by its shape, color, look, hair and probably many other things. i.e. he already knows what features describe an object. That knowledge has been built in advance so when a kid is shown with an elephants photo, he can quickly fit the elephants characteristics into his existing recognition framework: big ears, trunky body, long nose, grey color, round shape, etc. Since birth, a kid is learning from the surroundings continuously which compose a massive training dataset. But this is just for recognizing elephants. The logical ability is hard to be explained by model training though.

Dropout

This is another regularization strategy. 

The idea is to train only a random subnet of the network every mini-batch. This is achieved by randomly selecting a subset of neurons (alternatively disabling a subset of neurons) for training. The weights and biases are thus trained separately in each subnet every mini-batch. The weight of the same link can be updated independently at different rounds.

The training data is the same and divided into mini batches. The input layer and output layer remain the same. The feedforward and backpropagation algorithms remain the same as well. 

Assume we randomly select half of the network every mini-batch. Essentially we are training two smaller neural networks using the same training data. 

The training of every mini-batch adjusts the weights independently. As there are (logically) two independent subnets, the weights are added up to twice as big as they should be.

Considering a probability p of retaining a neuron in training phase, the feed into next layer from that neuron is p * activation * W.

When testing the model, no neuron is dropped, so the feed would becomes activation * W which is too big. Therefore, we rescale W with p and then the feed becomes activation * (p*W).

In this way, we make sure the feed still has the same distribution even we don't actually drop neurons.

Therefore, in every mini-batch, when adjusting weights by delta, times the OUTGOING weights' delta with p.

Another question is about biases. It sounds like the bias is added up too big as well. 

But seems no one/article mentions to rescale bias to p*bias. My guess is we don't need to adjust bias like we do for the weights.

When training sub nets, every neuron learns about it's own bias (for activation). When a bias is adjusted to a proper value, it would not further change much even we are training other subnets.

This assumes the activation threshold (bias) is similar in all sub nets, which is possible true. So we don't worry it will go over large when training multiple sub nets.

The weights are different though. When half of the neurons are dropped, a sub net try to increase the weights more than they should be because there is no other input from the dropped neurons.

If we don't rescale weights, the neural network seems still achieving about 80-80% accuracy but the performance fluctuates a lot, sometimes down to 50% and then back to 90%.

This is probably from overlarge weights, and then it trys to adjust back but again it negates too much.

The bias, rescaled or not, does not seem to impace the performance. The bias if rescaled to p*bias, seems to learn slightly slower, as it slows down the learning to the proper value.

Empirically, a good drop out rate is between 50% to 80%. 59% seems optimal.

Basically the dropout strategy simulates training multiple neural networks independently. Every network overfits in a different way and hopefully the sum of multiple networks average out the overfitting. The key idea is a bit similar to random forest that each subnet/tree vote for something independently. A quote:

"Since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons" and "In other words, if we think of our network as a model which is making predictions, then we can think of dropout as a way of making sure that the model is robust to the loss of any individual piece of evidence."

This is kind of similar to L1 and L2 regularization that try to reduce weights so that losing any link (with weight) wouldn't affect the network much.

Max Norm

MaxNorm is a particular form of regularization especially useful for dropout. The idea is simple - contraining the norm of the incoming matrix of a hidden layer by an upper bound c. The c is sometimes 3, 4, ...

If the norm of the matrix ||W|| < c, then rescale the matrix so that its norm falls under c.

norm = square_root( sum( square (w)))

The rescale is done by a piecewise times by the rescaling factor. 

w = w * max_norm/norm

This is like projecting w into a ball with radius c. If w is too big then shrink w proportionally in all dimensions to fit it into the ball.

This makes it possible to use large learning rate without the risk of weights blowing up. 

Expanding the training data

When it's hard to get real training data, it is possible to make up training data artificially.

Take hand written digits as a example, we can rotate the digit slightly clockwise and anti-clockwise by some degree. For voice data, we can speed up or slow down the speech data.

As long as the artificially made data makes sense and reflects what happens in real life, it helps to enrich the training data and thus increase the model precision. But if the made-up data doesn't make too much sense, it will screw up the model.

10. Tuning Neural Networks

Weight Initialisation

Considering a neural network with Cross Entropy cost function and sigmoid activation function, the slow learning problem is solved for the LAST layer by using cross entropy. However when back propagation propagates the Error back to previous layers, it needs to times a factor da/dz which is the derivative of the activation function. So sigmoid function still plays a significant role here. The z value in the previous layers can be initially too big due to poor weight initialization, and thus the da/dz tends to be small for a sigmoid function. That means the graidient calculated for the previous layers can still be too low and thus the slow learning problem for previous layers remains unresolved.

In a simplified solution all initial weights are randomized by Gaussian distribution (mean 0 & std deviation 1). Assume there are 1000 input neurons so there are 1000 links to every neuron in the next layer. Assume half of the 1000 input neurons have an input of 1 and the other half have an input of 0, so any neuron in the next layer will receive a sum of 500 Gaussian distributions (mean 0, std 1 variance =std^2=1). The sum is a Gaussian distribution with mean 0 and variance 500 i.e. std sqrt(500) ~ 22. There is a big chance that the sum is larger than say 1 / less than -1, in which case the z value is big enough to slow down the derivative of da/dz for a sigmoid function. There are so many neurons in the hidden layers so there is a big chance that neurons saturate and stop learning.

Then one might ask the question why not set all weights to zero, then the z value in the hidden layers are all 0 and the derivative will be big.

That doesn't work because the Back Propagation algorithm needs to times the weights to propagate Error back to previous layers. If all weights are zero, it propagates nothing back. So we need small weights but not zero.

One way is to use a Gaussian distribution N(0, 1/sqrt(n)) where n is the number of input neurons. When it goes to the next layer, the variance becomes n/2 * square(1/sqrt(n)) =1/2 assuming only n/2 of the neurons are fired, ie. std deviation =sqrt(1/2). This is a lot better. Adding a neurons bias as N(0,1) changes the variance to 1/2 + 1 = 3/2 and std deviation = sqrt(3/2). 

Repeating this for the weights of each of the hidden layers until the output layer, will remedy the slow learning problem. Biases are ok and just leave them to N(0,1).

The result of good weight initialization is obvious. The cost drops quickly and the precision starts high even in the first epoch compared to training with purely random weights. But note that weight initialization usually doesn't improve the final precision (sometimes a little bit). It simply learns faster so less number of epoches is needed. 

Super parameters

For a neural network there are a lot of things to tweak.

Number of neurons per layer, number of layers, which cost function to use, which activation function to use, learning rate, the lambda for regulariztion term, size of mini-batch, number of epoches, different ways to encode the output, etc.

There is no universal agreement on the right strategies to pick parameters. The problem is unsolved.

Usually a network with initial settings doesn't learn anything at all. No meaningful output from the network. Its better to simplify the network layers frist. Start from shallow network with less numbers of neurons. Try to get some meaningful response first, and then tweaking the learning rate and lambda to improve the performance. If the network layout changes again, it needs to tweak the learning rate and lambda again.

The learning rate is critical. If its too large, it skip the optimal solution by jumping two much every step. If its too small, it takes too many epoches to get to the optimal solution. So it is important to estimate the right learning rate. The search of learning rate can start from something like 0.1, and then check if the cost decreases after a few epoches. If it doesn't then try the next order by magnitude, 0.01, 0.001 etc. If it works, then try increase the learning rate by an order, 1, 10, etc. until the cost oscillates. There will be estimate of the order by magnitude of learning rate. Further tweak is needed but the estimate is close enough to not being too large nor too small.

The books says using training cost to select learning rate is reasonable, as its primary purpose is to control the step size of gradient descent. Its then reasonable to monitor the training cost to see if the step is too big or too small.

Early stopping

The network performance stops improving after a certain number of training epoches. So keep track of the performance after each epoch. If it doesn't improve in e.g. 10 epoches, stop the training. It can be 20, 30, etc. Just another super parameter.

Learning rate schedule

A big learning rate helps to drop the cost quickly in training. Once the improvement slows down you may want to use a smaller learning rate to refine the result. This is called learning rate schedule. It is achieved by monitoring the performance. It drops the learning rate when performance improvement slow down (oscillates) until the new learning rate is a factor of 1000 times lower than the initial value.

Mini-batch size

There are two extremes. First, one training record per movement. This is also called online training as it recieves one data record at a time and calculate graidient with one record. Second, using all training records per movement. This is too computationally expensive. Note that the computational cost of training with a mini-batch of 100 is not as big as the cost of training with 100 records separately, because matrix calculation is usually optimized with hardware. So the cost of a batch of 100 is maybe only 50 times as big.

Note that the learning rate using a batch of 100 can be 100 times as big as the learning rate for online training. When changing the mini-batch size, change the learning rate accordingly.

The batch size is relatively independent of other super parameters, so after figuring out a set of acceptable values for other super parameters, try different batch sizes, such as 10, 100, etc. and scale the learning reate proportionally. Plot the performance and the best batch size gives the most rapid improvement in performance (accuracy versus elapsed time, not epoch). Note to use elapsed time rather than epoch as an epochs duration is related to the batch size.

11. Improvement to Gradient Descent

Hessian technique

Cost(w + dw) can be expanded using Taylors expansion theory. Consider only up to the second order of dC/dw. The Cost(w + dw) then becomes a second order polynomial with respect to dw. When dw is a certain value, the cost is minimized. Calculate the analytical solution to the minimum cost, say dw, and then set w = w+dw.

Hessian considers information about second order changes in the Cost function and thus able to avoid many pothologies in gradient descent, and consequenlty requires less number of steps in gradient search.

However, to calculate the second order of derivative, a massive matrix of partial derivatives (Hessian Matrix) needs to be computed first, which is extremely computationally expensive.

Practically this technique is not used, but there are other techniques inspired by Hessian.

Momentum-based gradient search

A very simple change, define velocity v

v' = uv - learning_rate * gradient

w' = w + v

b' = b + v

The gradient descent updates the velocity. v gets bigger and bigger in the opposite direction of gradient. This is like a ball rolling down a valley quicker and quicker. The u is a super parameter between 0 and 1, to avoid the velocity accumulates to something too big.

The weight w or bias b is then determined by velocity. The bigger velocity, the bigger step size. Analog to step size = velocity * 1 unit of time.

v becomes an intermediate between gradient and weights. The velocity gets bigger each step so it's kind of like momentum (not exactly).

Other techniques worth mentioning:

conjugate gradient descent

BFGS method

Nesterov's accelerated gradient technique

Other Activation Functions

tanh  (tanch)

tanh(z) = (exp(z) - exp(-z)) / (exp(z) + exp(-z))

This is simply re-scaled from sigmoid.

sigmoid(z) = 1/(1+exp(-z))

It's easy to prove that sigmoid(z) = (1 + tanh(z/2))/2

The main difference is that tanh goes from -1 to 1 while sigmoid goes from 0 to 1.

Arguably tanh is better than sigmoid because it allows negative output. Recall that the back progagation algorithm calculates gradient based on Error times activation values from previous layer. Because the ativation values is always positive using sigmoid, the Error times activation values will be of the same sign for all neurons, i.e. the neurons weights will increase / decrease at the same time, without the ability to increase some neurons weights while decreasing other neurons weights. Tanh however enables to increase some weights while decreasing others weights.

Experiments have shown that the improvement is limited though. With sigmoid, one can still increase some weights and decrease others through multiple rounds of increasing and decreasing with different deltas.

Rectified Linear Neuron / Rectified Linear Unit, ReLu

The fomular is simple

Relu(z) = max(0, z), where z = wx+b

It simply keeps the positive part of the linear input and zero the negative part of it.

In theory, the combination of many Relu outputs can simulate any positive function, just similar to high order polynomials.

Experiments on image recogition has shown that ReLu is better than sigmoid / tanh, but people don't know exactly why.

Either sigmoid or tanh can saturate when the z value is close to the two ends and thus slow down the learning, but ReLu will not.

The derivative of Relu vanishes when the z value <=0, so the neuron stops learning entirely.

So far there is NO solid theory for choosing an activation function from an inifite number of candidates. We do not know which one works better / faster for which problem. Overall the mathematical foundation for neural networks is weak at the moment.

12. Deep Networks and problems

In theory you can approximate any function using only one layer of hidden neurons, i.e. sum of many many sigmoid / other activation functions.

However the number of required neurons can increase expoenentially when the required output function gets complicated. This can be solved by using multiple layers. For example, to calculate the multiplication of two numbers, you may use the first layer to calculate bit addition, and the second layer to calculate number addition and finally the third layer for multiplication. The total number of neurons will be less than the number of required neurons using less layers.

Similary in image recognizion, you may use early layers to process pixels and later layers to construct geometric shapes and finally recognizing objects. This is about reducing the total number of neurons required for a task, and also implementing a more logical design.

vanishing / Exploding Gradient

When using Gradient Descent and Back Propagation algorithms to train a deep network, problems emerge. The layer layers can learn fast but earlier layers tend to learn very slowly and even stop learning.

An example network with 2 hidden layers trained with MNIST showed the weights for the 1st hidden layer(2nd layer) change by about 0.003 in the first few epoches, while the 2nd hidden layer(3rd layer) and the last layer(4th layer) change weights by about 0.1 - 2. The learning speed further slows down in later epoches. This is only one example run. It shows that even for not-so-deep networks, early layers learn extremely slowly. Usually when the number of layers increases, the learning speed of early layers drops dramatically every time there is an additional layer.

In a deep network, the gradient tends to get much smaller as we move backward through the hidden layer using back propagation. This is known as the Vanishing Graident Problem. This is a fundamental problem with gradient descent, although sometimes we have the opposite. The gradient gets much bigger in earlier layers and consequently overflows the data variables or overshoots the problem, and it is called the Exploding Gradient Problem.

The gradient in deep networks is unstable, tending to either explode or vanish in early layers. This intrinsic instability is associated to learning by graident descent and back propagation in deep, many-layer networks.

"A diploma thesis (Hochreiter, 1991) represented a milestone of explicit DL research. As mentioned in Sec. 5.6, by the late 1980s, experiments had indicated that traditional deep feedforward or recurrent networks are hard to train by backpropagation (BP). Hochreiter’s work formally identified a major reason: Typical deep NNs suffer from the now famous problem of vanishing or exploding gradients. With standard activation functions, cumulative backpropagated error signals (Sec. 5.5.1) either shrink rapidly, or grow out of bounds. In fact, they decay exponentially in the number of layers or CAP depth (Sec. 3), or they explode." Back in the 90s computers were also slow, so people considered neural networks as a hassle to work with. Around the same time, SVM appeared to work very well, so people turned to SVM and the neural network winter came. Geoff Hinton and some other people have to move to Canada to continue research on neural networks as they got some modest funding there.

Why vanishing / Exploding Gradient

Lets look at a simplified network with 3 hidden layers and every layer has only one neuron.

(input)--w1--(b1)--w2--(b2)--w3--(b3)--(output)

the gradient of Cost with respect to weight w1 is:

dC/dw1 = dC/dz1 * dz1/dw1 = dC/dz1 * d(w1x+b1)/dw1 = dC/dz1 * x

dC/dz1 = dC/da1 * da1/dz1 = dC/da1 * sigmoid'(z1)

dC/da1 = dC/dz2 * dz2/da1 = dC/dz2 * d(w2*a1+b2)/da1 = dC/dz2 * w2

dC/dz2 = dC/da2 * da2/dz2 = dC/da2 * sigmoid'(z2)

...

Easily we have:

dC/dw1 = x * sigmoid'(z1) * w2 * sigmoid'(z2) * w3 * sigmoid'(z3) * dC/da3

The drivative of simoid function is between 0 and 1/4, so if the initial values of weights are small (e.g. random gaussian distribution 0,1) then the chain of multiplications will cause dC/dw1 to be a very small number, i.e. vanishing gradient.

If the initial weights are big enough, the chain mutiplication will cause dC/dw1 to increase expoentially, i.e. exploding gradient.

The above single neuron layer example is a straightforward and informal illustration of the vanishing/exploding gradient problem. When there are multiple neurons in each layer, the situation is similar but harder to show mathematically.

The problem here is actually not vanishing gradient or exploding gradient. Its that the gradient of an early layer is the product of terms from all later layers. When there are many layers, it causes an unstable situation. This unstable graident problem is the real problem. Changing activation function / weights wouldnt help solving the problem unless there is some kind of mechanism to balance the gradients at different layers.

To avoid vanishing gradient, we need |w * sigmoid'(z)| >= 1. As the derivative of sigmoid peaks at 1/4, wi must be >= 4.

z = wa+b and let x=e^(wx+b)

sigmoid'(zi) = sigmoid(zi)(1-sigmoid(zi))=1/(1+x) * 1/(1+x^-1) = 1/(2+x+x^-1)

So we have:

|w / (2+x+x^-1)| >= 1 

i.e

2+x+x^-1 <= |w|

i.e

x^2 + (2-|w|)x + 1 <=0

It's easy to show that x^2 + (2-|w|)x + 1 is less than zero when x is between -|w|sqrt(1/2-2/|w|) +|w|/2 -1 and |w|sqrt(1/2-2/|w|) +|w|/2 -1

As x = e^(wa+b) 

a is then in the range between ln(-|w|sqrt(1/2-2/|w|) +|w|/2 -1) /w and ln(|w|sqrt(1/2-2/|w|) +|w|/2 -1)/w which is a very small window. when w ~6.9 then range peaks at ~0.45, very tiny!

So there is a very tiny chance that vanishing graident will not happen with deep networks.

There are other problems with deep networks and many are still under ongoing research.

One of them is the activation function Sigmoid, it was found that under some conditions, sigmoid causes the activations of the last hidden layer to saturate near 0 in early learning.

Linton etc also found that a careful selection of initial weights and momentum schedule for momentum-based stochastic gradient descent will help traing deep networks substaintially.

Convolutional networks

Kernel

Define a filter / kernel as a small i x j matrix. The input image has a size of m x n.

Place the small matrix on the bigger image, and multiply each pixels value with the corresponding matrix element and then sum up.

e.g. the (0,0) of the kernel matrix is placed as (x,y) of the image

_________________________

|

|    (x,y)---

|    |       |

|    |_______|

|

|________________________

The area covered by a kernal matrix get transformed to a single value:

    area <dot> kernal + bias = value

The value is fed into the nueron at x,y in the next hidden layer.

x ranges from 0 to m-i and y ranges from 0 to n-j. So the next hidden layer is a m-i by n-j matrix.

The step-by-step dot products between filter matrix and the input matrix is called convolution in mathematics so that is where is name convolutional network coming from.


The key idea is that the filter (matrix) can be developed as a detector for a particular type of features (e.g. some particular shape).

The same filter is appled cross the whole image so similar shapes on the image will be detected by using exactly the same filter.

The small matrix is called Local Receptive Fields and the weights / bias of the matrix is shared by all hidden neurons in the next layer. Imagine the kernel is a detector of cat features then scanning through the whole image step by step with the same kernel will help to figure out where a cat is in the image.

A convolutional layer typically has multiple filters so it can detect different features.

One big advantage of using filters is to reduce the number of parameters greatly. A fully connected layer of 784 neurons linked to a next layer of 30 hidden neurons will have 784 * 30 weights + 30 biases to learn. Using a convolutional layer with 20 filters of the size 5x5 reduces the number of parameters to 20 * (5*5 weights + 1 bias) = 520 which is a lot smaller.

In theory we can't compare a fully connected layer to a convolutional layer because they work in essentialy different ways, but still a convolutional layer captures features with less parameters while achieving same/similar performance so the training speed is quicker.

Pooling

Pooling is a down-samping technique to further reduce the spatial size of the representation from a convolutional layer.

A commonly used setting is Max-Pooling which uses 2 x 2 kernel(filter) with a stride of 2, and outputs the max value from the 2 x 2 neurons.

This divides a matrix representation into a grid with a cell size of 2 x 2, and each cell is then fed to a neuron in the next layer. In this way the number of neurons is reduced to 1/4.

Other pooling types are L2 pooling (square root of the sum of squares of all the 2x2 neurons), average pooling, etc.

Pooling is applied on every of the filters outputs.

Multi input channels and multi output channels

Assume there are 3 input channels of an image (the red, green and blue value of an image).

And there are 10 output channels. Each output channel corresponds to a filter so there are 10 filters.

Each filter is apply on all the 3 input channels. The 3 results are simply summed up and added with a bias before passing onto an output channel.

Repeat the same for each filter and get 10 output channels.

Note it is also valid & useful to use separate filters on different input channels.

In PyTorch libarry the parameter 'groups' of nn.Conv2d specifies how input channels are passed onto output channels.

E.g. if there are 3 input channels and 9 output channels, we can set groups = 3, so the first channel goes to only the first 3 output channels. The second input channel goes to the 4-6 output channels and the 3rd input channel goes to 7-9 output channels.

When groups=1, all input channels are applied with a filter and sum up the results. 

The output channels of a convolution layer is flatten and fully connected to the next hidden layer. So the links between the output channels and the next hidden layer carry weights.

If you prefer the links to not carry weights (maybe for ease of programming) then the next hidden layer simply receives the original values from the output channels. The hidden layer is not doing anything but serving as a carousel to pass on data from output channels to the second hidden layer.

To put all layers together:

(1) image as an input layer

(2) convolution layer, which applies kernels on the image, and then applied activation function on the output

(3) the convolution layer output feeds into a pooling layer, whose output is smaller than the input (ie.down sampled)

(4) the pooling layer applies no activation function but directly feeds to a fully connected layer

(5) other fully connected layers if needed.

The training process is still the same. Stochastic gradient descent (SGD) will adjust the weights and biases of (fully connected layers and convolution layers). So we learn the kernels too.

The back propagation algorithm is slightly different for the convolution and pooling layers.

The pooling layer (e.g. max) will pass on the Error (delta) to only the neuron with the max value.

The convolution layer's dC/dw and dC/db calculations are different too. Check study notes for details. dC/db = sum(deltas from pooling), dC/dw = convolve(image, activations).

If programming is too hard, many tools like Theano, TensorFlow and PyTorch can help. Simply specify the layers and that is it.

Ways to improve performance.

1. more than one convolution layers. 

The kernel of the first conv layer abstracts and condenses the original image, but it still has a lot of spatial structure for the next conv layer to learn about the pattern.

2. A different activation function. 

Arguable, tanh is slightly better (converge faster) than sigmoid but this is not convincing. 

Rectified Linear Unit constantly performs better than tanh and sigmoid. However so far we have a poor understanding of the answer.

3. Expanding training data

We can modify the original dataset by moving the image by a pixel to the left/right/up/down, or by rotating the image a bit, or cropping images, etc, and create new training data.

This simply trick improves performance substantially as the new data set captures more scenarios.

4. Extra fully connected layer & extra neurons in an existing fully connected layer

This would help a bit but not much.

5. Drop out

Randomly drop e.g. half of the neurons from a fully connected layer. 

This gives substantial improvement, as Drop Out brings in randomness and reduces overfitting, and so it learns faster.

Note we apply drop out to fully connected layers only. Convolution layer is already forced to scan through the whole image so it's less likely to pick up (overfit) on local idiosyncracies.

6. Ensemble of networks

Even training a network with the same parameters you will still get different networks (different weights, etc.)

Each network might overfit something but predict weill in other things.

So let all the networks vote to determine the best result. This give further improvement.

This ensemble trick is common in other models too, not just neural networks.

Why deep learning did't work back in 90s? How do we solve the vanishing / exploding gradient problems here???

* No, we haven't solved the vanishing/exploding gradient issues.

* However we have done a few things to help us proceed anyway.

1. Convolutional layers greatly reduce the number of parameters in those layers. So less weights and biases to learn.

2. More powerful regularization techniques other than L1, l2. Notably Dropout and convolution itself to reduce overfitting.

3. Use Rectified Linear Unit other than Sigmoid / Tanh to speed up training by a factor of 3-5.

4. GPU is very helpful, so we can train for a longer period of time (more epochs). And more powerful computers these days.

5. Other small ideas. Using the right cost function, using good weight initialization, expanding dataset, etc.

All the above ideas (simple but powerful) enable us to start training a deep network. 3/4 layers are counted as deep networks!

This is a real breakthrough as it's now practical to go beyond the shallow 1 or 2 hidden layer networks that dominated work until the mid-2000s!

Other types of networks

Recurrent Neural Networks

In a static feed forward network, a neuron takes in inputs from previous layer and outputs a value as per activation function.

Everything is deterministic.

What if the behaviour of a hidden neuron is not just determined by the activations in the previous layers, but also by

activations at earlier times. This is similar to people react to something not just based on the current situation,

but also situations met before.

Long short-term memory units

In Recurrent Neural Networks, the gradient becomes event more unstable when propagate backward, because the gradient not only propagate back

through layers, but also backward through TIME. That makes the training of early layers extremely slow, slower than a normal feed forward network.

Deep belief nets, generative models, boltzmann macines

DBN is an example of generative model. It can be used in a similar way to feedforward network, but it can also run backward

given the values of some of the feature neurons. So DBN can learn to recognize handwritten digits and also simulate to write

handwritten didgits.

Moreover, it can do unsupervised and semi-supervised learning.

It's getting less popular these days because CNN and RNN have been so successful and the winner takes all the thunder.

Deep learning + reinforcement learning

This learns to play video games...

What deep network is trying to do?

Using image pixels to answer a complex question, e.g. is it human face?, directly is very hard.

We can decompose a complex question into sub questions, e.g. two eyes on the top? one nose in the middle? a mouse in the bottom? etc. If all the answers are yes or probably yes, then we can conclude that it's likely a face.

A question like 'a nose in the middle?' can also be further decomposed to smaller questions regarding the shape, size, nose holes, skin color, etc.

Smaller questions like 'the shape' is easier to be answered by checking the pixels.

A neural network breaks down a very complicated question - does this image show a face or not - into very simple questions answerable at the level of single pixels. It does this through a series of many layers, with early layers answering very simple and specific questions about the input image, and later layers building up a hierarchy of ever more complex and abstract concepts. Networks with this kind of many-layer structure - two or more hidden layers - are called deep neural networks

//maybe more to add here later.

Appendix. Problem with back propagation and solutions

When the number of layers of a neural network increases, e.g., 3 or 4 or more, not only the computational cost grows but also back propagation tends to fail to calculate the gradients.

For a deep network trained by back propagation, the calculated error shrinks rapidly, or grows out of bounds. This is called the vanishing / exploding gradient problem.

A diploma thesis (Hochreiter, 1991) represented a milestone of explicit DL research. As mentioned in Sec. 5.6, by the late 1980s, experiments had indicated that traditional deep feedforward or recurrent networks are hard to train by backpropagation (BP) (Sec. 5.5). Hochreiter’s work formally identified a major reason: Typical deep NNs suffer from the now famous problem of vanishing or exploding gradients. With standard activation functions (Sec. 1), cumulative backpropagated error signals (Sec. 5.5.1) either shrink rapidly, or grow out of bounds. In fact, they decay exponentially in the number of layers or CAP depth (Sec. 3), or they explode.

Back in the 90s computers were also slow, so people considered neural networks as a hassle to work with. Around the same time, SVM appeared to work very well, so people turned to SVM and the neural network winter came. Geoff Hinton and some other people have to move to Canada to continue research on neural networks as they got some modest funding there.

------------------a working Convolution NN with drop out for MNIST dataset -----------------

"""

A layer of convolution, a max pooling layer, a hidden layer, an ouput layer

Apply dropout to reduce overfitting

The dropout is simulated by applying filter(mask) on the weight matrix. The dropped neurons' rows and columns in the matrix are zeroed

A convolution layer, 20 x 5 x 5 kernels

followed by a 20 x 2 x 2 max pooling

Apply Relu on max pooling output

A hidden layer using Relu activation function

A softmax layer as the output

Cost funciton = cross entropy (negative log likelihood)

regulariztion = L2

use N(0, 1/n) distribution to initialize weights 

"""

import random

import numpy as np

from scipy.ndimage import convolve

from skimage.measure import block_reduce

import pickle

import gzip

import matplotlib.pyplot as plt

import sys

from datetime import datetime

class MNISTLearner(object):

    def __init__(self, sizes=None):

        

        '''

        the input are 28 x 28 = 784 pixels (directly into convolution layer)

        the output from convolution layer are 24 x 24 = 576 

        

        the 576 input directly into a 2x2 max pooling layer 

        the output from max pooling are 12 x 12 = 144

        Apply activation function (eg relu) on the max pooling output

        

        the 12x12 neurons from the max pooling output go straight to the next layer of 30

        Consider it as a dummy layer to hold the 144 neurons (logical)

        

        A hidden layer of 30 neurons

        

        A soft max layer of 10 neurons

        

        So layers are:

         28x28 -> 24x24 | 24x24 -> 12x12 (a logical layer of 12x12) | 30 | 10

        '''

        

        '''the first layer - convolutional layer'''

        #kernal shape

        self.kernel_shape = (5, 5)

        self.maxpool_shape = (2, 2)

        #the number of kernels/filters

        self.num_kernels = 10

        #the weights of the 5 x 5 kernel

        self.kernel_weights = [np.random.randn(self.kernel_shape[0], self.kernel_shape[1]) for i in range(self.num_kernels)]

        #the bias shared by the convolutional layer

        self.kernel_bias = [np.random.randn() for i in range(self.num_kernels)]

        

        '''the second layer - 2x2 max pooling layer'''

        #the max pooling layer simply receives 24x24=576 values from the convolution output

        #and map the 576 neurons to an output of 12x12=144 

        #apply relu on the 144 outputs

               

        '''the layers after the max pooling layer'''

        #the 144 neurons output from max pooling fully connect to the next layer

        #let's consider the traditional network starts from the 144 neurons

        #The calculation of convolutional and maxp layers' error & derivatives are different to the traditional networks

        self.sizes = [144 * self.num_kernels, 100, 10] 

        self.num_ordinary_layers = len(self.sizes) #lets call it the ordinary layers (non official)

        self.drop_rates = [0, 0.5, 0] #drop out rate for the layers, no drop out for the input and output

        

        #bias for the hidden and output layers

        #randn generates random numbers from normal distribution with the shape (i, 1), i.e. column vector    

        #this doesn't include the convolutional and maxpooling layers

        self.biases = [np.random.randn(i, 1) for i in self.sizes[1:]]               

          

        #a list of matrixs for 2nd layer and after, this doesn't include the convolutional and maxpooling layers

        #weights = [weights to layer2, weights layer3, ... weights to last layer]

        #the first layer has no weights as it's only an input

        self.weights = [np.random.randn(y, x)/np.sqrt(x) for x, y in zip(self.sizes[:-1], self.sizes[1:])]

        #more explanation below:

        #each weight matrix has a shape (k, j)

        #where j is the number of neurons in layer L-1, and k is the number of neurons in layer L (i.e current layer)

        #so matrix[k][j] is the weight of link from neuron j in the previous layer to neuron k in the current layer

        #in this way,given input data X as a column vector

        # then matrix <dot> X + bias is the output of the current layer, where X is the output of the previous layer

        # ie. y = weights * x + b, this mathematically looks nice. 

        # ,

        # here initialize weights as N(0,1/n) where n is number of neurons in the previous layer

        #this for initialising the weights properly to avoid the neurons of the hidden layers saturate    

        #variance becomes 1/n. As var =(x-e(x))^2 and e(x)=0, x value becomes sqrt(1/n).    

   

    '''

    Feed forward, intput x, and go though convolutional layer, max pooling and subsequent matrix muplications 

    to get the activation value output from the last layer

    after looping through all layers, return the final output

    image is a 28 x 28 array

    Note the image & convolutional layer and maxpooling layers are matrix but ordinary layers are column vectors.

    '''

    def feedforward(self, image):

        '''the first layer, convolutional layer, which takes in the image directly'''

        #the convolve function conducts the convolution operation

        #for matrix area not completely overlap with the kernel, the 'convolve()' still yields a value

        #keep only the overlapped area in the output

        outs = [convolve(image, w) + b for w, b in zip(self.kernel_weights, self.kernel_bias)]

        overlap = np.array(image.shape) - np.array(self.kernel_shape) + 1

        outs = [out[:overlap[0], :overlap[1]] for out in outs] #the trim the area that is not completely overlapped with the kernel

        #no activation operation here, leave it to the maxpooling layer

        

        '''the second layer, a 2x2 max pooling layer'''

        outs = np.array([block_reduce(o, (2,2), np.max) for o in outs]) #pick the max value out of every 2x2 square

        x = outs.reshape((outs.size,1)) #reshaped as a column vector, the output is passed onto the next ordinary layer 

        x = self.relu(x)  #apply activation function on the max poolings output

        

        '''continue wx + b calculation, use relu for all ordinary layers except the last'''

        for b, w in zip(self.biases[:-1], self.weights[:-1]):

            x = self.relu(np.dot(w, x)+b)

            

        '''use softmax in the last layer'''

        x = self.softmax(np.dot(self.weights[-1], x)+self.biases[-1])

        

        return x

       

    '''

    back propagation

    calculate the gradient with respect to weights and biases for the traning sample x,y

    the current weights, biases and the training sample defines the error for each neuron

    the error further defines the gradients

    the weight_filter and layer_filter are for excluding neurons, weights, biases that are dropped out.

    e.g weight = [1,2,3,4,5]^T, weight_filter =[0,1,1,0,0]^T then weight * weight_filter means the 1st, 4th and 5th elements are droped out

    weight filter goes from the second layer to the last layer

    layer_filter goes from the second layer to the last layer 

    '''   

    def backprop(self, image, y, weight_filter, layer_filter):

        #temp variables for keeping z values and activation values

        zs = []   

        activations = []

        

        #initialize the gradient w.r.t weights and biases for the ordinary layers

        nabla_b = [np.zeros(b.shape) for b in self.biases]

        nabla_w = [np.zeros(w.shape) for w in self.weights] 

           

        '''feed forward, convolutional layer and max pooling layer'''      

        #input layer & also the convolutional layer

        #apply convolution op on the original image         

        outs = [convolve(image, w) + b for w, b in zip(self.kernel_weights, self.kernel_bias)]

        overlap = np.array(image.shape) - np.array(self.kernel_shape) + 1

        conv_outs = [out[:overlap[0], :overlap[1]] for out in outs] #the trim the area that is not completely overlapped with the kernel

                      

        #max pooling layer

        maxpool_outs = np.array([block_reduce(out, (2,2), np.max) for out in conv_outs]) #pick the max value out of every 2x2 square

        z = maxpool_outs.reshape((maxpool_outs.size,1)) #reshaped as a column vector

        zs.append(z) #keep reduced matrix

        activation = self.relu(z)  #apply activation function on the max poolings output

        activations.append(activation)        

        

        '''feed forward, ordinary layers '''      

        #run feed forward first to get the z values and the activation values for all layers except the last oner

        for b,w, lf, wf in zip(self.biases[:-1], self.weights[:-1], layer_filter[:-1], weight_filter[:-1]):

            z = (np.dot(w * wf, activation) + b) * lf #get to the next layer by calculating wx + b.

            zs.append(z)           

            activation = self.relu(z) * lf   #activate the neuron

            activations.append(activation) #keep track of the output of each layer

            

        '''feed forward, the last layer'''

        z = np.dot(self.weights[-1], activation) + self.biases[-1] #no drop out for the last layer

        zs.append(z)

        activation = self.softmax(z)   #activate the neuron

        activations.append(activation)

        

      

        '''back propagation, last layer'''         

        #calculate the Error (delta) of the last layer    

        #refer to negative log loss and softmax

        delta = activations[-1] - y     

                

        #the gradient w.r.t bias equals delta

        nabla_b[-1] = delta

        

        #the gradient w.r.t. weight in layer L = the Error * activation from layer L-1

        #delta is a column vector, activation is also a column vector

        #delta <dot> activation.transpose yields a matrix that keeps the weights from every neuron in L-1 to every neuron in L.         

        #the activations[-2] has the dropped neurons zeroed out already

        nabla_w[-1] = np.dot(delta, (activations[-2]).transpose())  

            

        '''back propagation, ordinary layers before the last'''  

        #use back propagation to calculate the Error for the previous layers

        #Also the biases and weights for the previous layers back to the first ordinary layer (after maxpooling)

        #delta_L = (w_L+1)^T <dot> delta_L+1 <elementwise multiplication> activation'(z)

        #note the loop accesses array element through negative index

        for i in range(2, self.num_ordinary_layers):

            layer = - i #negative index

            #exlude the weights of dropped neurons in layer L+1, and zero out the deltas of layer L by timing layer_filter

            delta = np.dot( (self.weights[layer + 1] * weight_filter[layer + 1]).transpose(), delta) * self.relu_prime(zs[layer]) * layer_filter[layer]

            nabla_b[layer] = delta

            nabla_w[layer] = np.dot(delta, activations[layer-1].transpose())            

        #up here we have the derivatives w.r.t. weights of links going from the 144 neurons (output from maxp) to the 30 neurons (next ordinary layer)       

        

        

        '''back propagation, max pooling layer'''

        #Propagate the errors (delta) back to the max pooling layer

        #the links to max polling are different to the links to an ordinary layer

        #so no need to calcuclate nabla_b or nabla_w for maxp here, leave it to the convolutinal layer

        delta = np.dot(self.weights[0].transpose(), delta) * self.relu_prime(zs[0])

        

        #reshape the maxpooling layer's delta vector to the 12x12 matrix shape

        #note these are the output neurons of max pooling

        delta = delta.reshape(maxpool_outs.shape) 

        

        #here calculate the error of maxpool's input neurons, 24 x 24      

        delta_maxp = np.array([np.zeros(co.shape) for co in conv_outs]) #the output from convolution,ie. input to max pool, initialized as zeros

        for k in range(delta.shape[0]):

            for i in range(delta.shape[1]):

                for j in range(delta.shape[2]):

                    #in maxp layer, the 4 neurons 2i+p, 2j+q (p, q in 0,1) map to one output neuron i, j

                    #get the index p,q of the max neuron in a 2x2 area

                    neurons2x2 = conv_outs[k][2*i:2*i+2,2*j:2*j+2] #get the corresponding 2x2 region in maxpool input, i.e. convolution output

                    max_pq = np.unravel_index(np.argmax(neurons2x2), neurons2x2.shape) #get the position (p, q) of the max value in the 2x2 region                

                    delta_maxp[k, 2*i + max_pq[0], 2*j + max_pq[1]] = delta[k, i, j]  #only the max neuron is passed on with the previous error, other neurons remain zero

        delta_maxp = [dm for dm in delta_maxp] #convert to a list

                

        ''' use max pooling layer's Error (delta) to calculate the kernel weights and bias for the convolutional layer  '''                      

        #derive the kernel_nabla_b 

        kernel_nabla_b = [np.sum(dm) for dm in delta_maxp]

        

        #derive the kernel_nabla_w

        #this is the tricky part, refer to formulas for the convolutional layer        

        kernel_nabla_w = [convolve(image, dm) for dm in delta_maxp]

        overlap = np.array(image.shape) - np.array(delta_maxp[0].shape) + 1

        kernel_nabla_w = [kw[:overlap[0], :overlap[1]] for kw in kernel_nabla_w]

                          

        return (nabla_b, nabla_w, kernel_nabla_b, kernel_nabla_w)

    

    '''

    update the weights w and bias b using a mini batch of training sample

    calcualte gradient for each sample (x,y) separately and then average the sum to approximate the actual gradient

    '''

    def update_mini_batch(self, mini_batch, eta, n, lmbda):

        #initialize the gradient of Cost w.r.t. biases and weight

        sum_nabla_b = [np.zeros(b.shape) for b in self.biases] 

        sum_nabla_w = [np.zeros(w.shape) for w in self.weights]

        sum_kernel_nabla_b =[0 for b in self.kernel_bias]

        sum_kernel_nabla_w =[np.zeros(w.shape) for w in self.kernel_weights]

        

        #iterate through every training sample in the mini batch

        #calculate the gradient for each of the samples and sum up

        for x, y in mini_batch: 

            #apply drop rate to get a weight filter

            #ignore the first input layer so drops_rates starts from 1.

            weight_filter=[] #filter on matrix columns. first layer has no weight

            layer_filter =[] #indicate which neurons are dropped per layer

            for dr, w, b in zip(self.drop_rates[1:], self.weights, self.biases):

                filter_rows = np.random.rand(w.shape[0]) #a listof random numbers between 0 and 1                                           

                filter_rows = np.where(filter_rows >= dr, 1, 0) #drop dr% of rows, the surving element assigned 1, else 0

                filter_rows = filter_rows.reshape((len(filter_rows),1)) #convert to a column vector               

                layer_filter.append(filter_rows)

                weight_row_filter = np.repeat(filter_rows, w.shape[1], axis=1) #repeat the column vector so only the surviving row's columns have 1, else 0

                if len(layer_filter)>1 : #there is a previous filter

                    #the previous weight's row filter is the current weight's column filter, 

                    #as a dropped neuron affects the incoming and outgoing weights at rows and columns respectively

                    filter_cols = layer_filter[-2].reshape((1,len(layer_filter[-2]))) #convert back to a matrix with one row [[1,2,3...]], note this is not a flatten row vector [1,2,3...]                    

                    weight_col_filter = np.repeat(filter_cols, w.shape[0], axis=0) #repeat the row 

                    weight_filter.append(weight_row_filter * weight_col_filter) #combine the two filters to form a matrix mask  

                else:

                    weight_filter.append(weight_row_filter) #there is no previous filter               

            #back propagation to calculate the gradient for sample x, y

            delta_nabla_b, delta_nabla_w, delta_kernel_nabla_b, delta_kernel_nabla_w = self.backprop(x,y, weight_filter, layer_filter)

            

            #sum up the gradient calculated from each training sample and take the average (later) outside of the loop

            sum_nabla_b = [nb + dnb for nb, dnb in zip(sum_nabla_b, delta_nabla_b)] 

            sum_nabla_w = [nw + dnw for nw, dnw in zip(sum_nabla_w, delta_nabla_w)] 

            sum_kernel_nabla_b = [nb + dnb for nb, dnb in zip(sum_kernel_nabla_b, delta_kernel_nabla_b)]

            sum_kernel_nabla_w = [nw + dnw for nw, dnw in zip(sum_kernel_nabla_w, delta_kernel_nabla_w)]      

                         

        #take the average of the gradients of all samples, some people omit averaging it and it's usually ok.

        #the average here is considered as an estimate of the actual gradient

        avg_nabla_b = [dnb/len(mini_batch) for dnb in sum_nabla_b]

        avg_nabla_w = [dnw/len(mini_batch) for dnw in sum_nabla_w]      

        avg_kernel_nabla_b = [dnb/len(mini_batch) for dnb in sum_kernel_nabla_b]

        avg_kernel_nabla_w = [dnw/len(mini_batch) for dnw in sum_kernel_nabla_w]

        

        #when applying drop out, each mini batch is training an independent subnet, but the weights and biases from different subnets will be added up

        #so rescale the weights and biases proportationally to a smaller. e.g. drop out = 50% then the weights * 50%

        #scale_factor = [(1-dr) for dr in self.drop_rates] #note the first layer (input) is included here

        

        #adjust the weights and bias by moving a delta in the opposite direction of gradient.

        #lmbda/n * w is the derivative of the L2 penalty term

        #self.weights = [ w - eta * nw - eta * lmbda/n * w   for w, nw, dr in zip(self.weights, avg_nabla_w)]

        #self.weights = [ w - eta * nw for w, nw in zip(self.weights, avg_nabla_w)] #no penalty term

        #rescale the weights due to dropout, note, times the outgoing weight with drop rate, not the incoming weights

        self.weights = [ w - eta * nw * (1-dr) - eta * lmbda/n * w *(1-dr)   for w, nw, dr in zip(self.weights, avg_nabla_w, self.drop_rates[:-1])]

        

        self.biases = [ b - eta * nb for b, nb in zip(self.biases, avg_nabla_b)]       

        #rescale the biases, which seems not necessary, note to use drop rate of the current layer

        #self.biases = [ (b - eta * nb) * (1-dr) for b, nb, dr in zip(self.biases, avg_nabla_b, self.drop_rates[1:])]         

        

        #adjust kernel weights separately

        self.kernel_weights = [w - eta * nw - eta * lmbda/n * w for w, nw in zip(self.kernel_weights,avg_kernel_nabla_w)]

        #self.kernel_weights = self.kernel_weights - eta * avg_kernel_nabla_w

        self.kernel_bias = [b - eta * nb for b, nb in zip(self.kernel_bias, avg_kernel_nabla_b)]

        

        #print(self.kernel_weights)

    

    '''

    stochastic gradient descent sgd

    every round of training uses a random mini batch (subset) of training data.

    Pre-divide the training data into mini batches and use them one after another.

    once all mini batches have been used, it is said an epoch is done

    then shuffle the training data randomly again and start another epoch

    the function stops when the specified number of epochs of training have been done.

    the test data (if available) is for tracking the performance of each epoch

    the testing and printing result slows down the process a lot so comment it out if not needed.

    '''

    def sgd(self, training_data, epochs, mini_batch_size, eta, lmbda, test_data=None):      

        n = len(training_data)

           

        for i in range(epochs):

            #re-shuffle the training data every epoch

            random.shuffle(training_data) 

            #divide the training data set into mini batches as per the mini batch size

            mini_batches = [training_data[j:j+mini_batch_size] for j in range(0, len(training_data), mini_batch_size)]       

          

            #adjust weights and biases every mini batch

            for mini_batch in mini_batches:

                #update the weights w and biases b by moving one step 'eta' using a mini batch of data

                self.update_mini_batch(mini_batch, eta, n, lmbda) 

                

            #print out the current performance if test data available              

            if test_data:

                accuracy = self.evaluate(test_data)

                print("Epoch {0}: {1}     {2}".format(i, accuracy, datetime.now()))               

            else:

                print("Epoch {0} done".format(i))

                       

    

    def relu(self, z):

        return np.where(z > 0, z, 0)

    

    def relu_prime(self, z):

        return np.where(z > 0, 1, 0)

    

    '''

    the activation for the last layer  

    The C is for avoiding overflow

    '''

    def softmax(self, z):

        C = np.max(z)

        expz = np.exp(z - C)

        return expz / np.sum(expz)   

       

    '''

    test the current neural network with test_data and returns the % of correct predictions    

    '''

    def evaluate(self, test_data):

        #return the index of the largest output value for each test sample

        #the index is from 0 to 9 so index represents the actual recognised number

        test_results = [(np.argmax(self.feedforward(x)), np.argmax(y)) for (x,y) in test_data]

        #return the % of correct predictions

        return sum(int(x==y) for (x,y) in test_results) / float(len(test_data))

        

print('========== start running main method ==========')

#load data from zip file

f = gzip.open('c://temp//mnist.pkl.gz', 'rb')

u = pickle._Unpickler(f)

u.encoding = 'latin1'

training_data, validation_data, test_data = u.load()

f.close()

#reading data in

training_inputs = [np.reshape(x, (28,28)) for x in training_data[0]]

training_results = []

for y in training_data[1]:

    vector = np.zeros((10,1))

    vector[y] = 1.0

    training_results.append(vector) #vectorize y, so it can be used in back propagation for matrix multiplication.

tr_data = list(zip(training_inputs, training_results)) #in python3 needs to convert zip to list

#validation_inputs = [np.reshape(x, (784,1)) for x in validation_data[0]]

#va_data = zip(validation_inputs, validation_data[1]) #no need to vectorize the result, it's used for validation only. Wont be used for matrix multiplication.

test_inputs = [np.reshape(x, (28,28)) for x in test_data[0]]

test_results = []

for y in test_data[1]:

    vector = np.zeros((10,1))

    vector[y] = 1.0

    test_results.append(vector)

te_data = list(zip(test_inputs, test_results))

print('========== finish loading data ==========')

print('========== start training {0} =========='.format(datetime.now()))

#nn = MNISTLearner(sizes = [784, 100, 10]) 

nn = MNISTLearner()       

nn.sgd(tr_data, epochs=30, mini_batch_size=10, eta = 0.01, lmbda = 0.1, test_data = te_data)

print('========== training is done ==========')