## 1. OverviewIn this codelab, you will learn how to build and train a neural network that recognises handwritten digits. Along the way, as you enhance your neural network to achieve 99% accuracy, you will also discover the tools of the trade that deep learning professionals use to train their models efficiently. This codelab uses the MNIST dataset, a collection of 60,000 labeled digits that has kept generations of PhDs busy for almost two decades. You will solve the problem with less than 100 lines of Python / TensorFlow code. ## What you'll learn- What is a neural network and how to train it
- How to build a basic 1-layer neural network using TensorFlow
- How to add more layers
- Training tips and tricks: overfitting, dropout, learning rate decay...
- How to troubleshoot deep neural networks
- How to build convolutional networks
## What you'll need- Python 2 or 3 (Python 3 recommended)
- TensorFlow
- Matplotlib (Python visualisation library)
Installation instructions are given in the next step of the lab. ## 2. Preparation: Install TensorFlow, get the sample codeInstall the necessary software on your computer: Python, TensorFlow and Matplotlib. Full installation instructions are given here: INSTALL.txt Clone the GitHub repository: $ git clone https://github.com/martin-gorner/tensorflow-mnist-tutorial The repository contains multiple files. The only one you will be working in is When you launch the initial python script, you should see a real-time visualisation of the training process: $ python3 mnist_1.0_softmax.py Troubleshooting: if you cannot get the real-time visualisation to run or if you prefer working with only the text output, you can de-activate the visualisation by commenting out one line and de-commenting another. See instructions at the bottom of the file. The visualisation tool built for TensorFlow is TensorBoard. Its main goal is more ambitious than what we need here. It is built so that you can follow your distributed TensorFlow jobs on remote servers. For what we need in this lab matplotlib will do and we get real-time animations as a bonus. But if you do serious work with TensorFlow, make sure you check out TensorBoard. ## 3. Theory: train a neural networkWe will first watch a neural network being trained. The code is explained in the next section so you do not have to look at it now. Our neural network takes in handwritten digits and classifies them, i.e. states if it recognises them as a 0, a 1, a 2 and so on up to a 9. It does so based on internal variables ("weights" and "biases", explained later) that need to have a correct value for the classification to work well. This "correct value" is learned through a training process, also explained in detail later. What you need to know for now is that the training loop looks like this:
Let us go through the six panels of the visualisation one by one to see what it takes to train a neural network. Here you see the training digits being fed into the training loop, 100 at a time. You also see if the neural network, in its current state of training, has recognized them (white background) or mis-classified them (red background with correct label in small print on the left side, bad computed label on the right of each digit). There are 50,000 training digits in this dataset. We feed 100 of them into the training loop at each iteration so the system will have seen all the training digits once after 500 iterations. We call this an "epoch". To test the quality of the recognition in real-world conditions, we must use digits that the system has NOT seen during training. Otherwise, it could learn all the training digits by heart and still fail at recognising an "8" that I just wrote. The MNIST dataset contains 10,000 test digits. Here you see about 1000 of them with all the mis-recognised ones sorted at the top (on a red background). The scale on the left gives you a rough idea of the accuracy of the classifier (% of correctly recognised test digits) To drive the training, we will define a loss function, i.e. a value representing how badly the system recognises the digits and try to minimise it. The choice of a loss function (here, "cross-entropy") is explained later. What you see here is that the loss goes down on both the training and the test data as the training progresses: that is good. It means the neural network is learning. The X-axis represents iterations through the learning loop. The accuracy is simply the % of correctly recognised digits. This is computed both on the training and the test set. You will see it go up if the training goes well. The final two graphs represent the spread of all the values taken by the internal variables, i.e. weights and biases as the training progresses. Here you see for example that biases started at 0 initially and ended up taking values spread roughly evenly between -1.5 and 1.5. These graphs can be useful if the system does not converge well. If you see weights and biases spreading into the 100s or 1000s, you might have a problem. The bands in the graphs are percentiles. There are 7 bands so each band is where 100/7=14% of all the values are.
What are "weights" and "biases" ? How is the "cross-entropy" computed ? How exactly does the training algorithm work ? Jump to the next section to find out. ## 4. Theory: a 1-layer neural networkHandwritten digits in the MNIST dataset are 28x28 pixel greyscale images. The simplest approach for classifying them is to use the 28x28=784 pixels as inputs for a 1-layer neural network. Each "neuron" in a neural network does a weighted sum of all of its inputs, adds a constant called the "bias" and then feeds the result through some non-linear activation function. Here we design a 1-layer neural network with 10 output neurons since we want to classify digits into 10 classes (0 to 9). For a classification problem, an activation function that works well is softmax. Applying softmax on a vector is done by taking the exponential of each element and then normalising the vector (using any norm, for example the ordinary euclidean length of the vector). Why is "softmax" called softmax ? The exponential is a steeply increasing function. It will increase differences between the elements of the vector. It also quickly produces large values. Then, as you normalise the vector, the largest element, which dominates the norm, will be normalised to a value close to 1 while all the other elements will end up divided by a large value and normalised to something close to 0. The resulting vector clearly shows which was its largest element, the "max", but retains the original relative order of its values, hence the "soft". We will now summarise the behaviour of this single layer of neurons into a simple formula using a matrix multiply. Let us do so directly for a "mini-batch" of 100 images as the input, producing 100 predictions (10-element vectors) as the output. Using the first column of weights in the weights matrix W, we compute the weighted sum of all the pixels of the first image. This sum corresponds to the first neuron. Using the second column of weights, we do the same for the second neuron and so on until the 10th neuron. We can then repeat the operation for the remaining 99 images. If we call X the matrix containing our 100 images, all the weighted sums for our 10 neurons, computed on 100 images are simply X.W (matrix multiply). Each neuron must now add its bias (a constant). Since we have 10 neurons, we have 10 bias constants. We will call this vector of 10 values b. It must be added to each line of the previously computed matrix. Using a bit of magic called "broadcasting" we will write this with a simple plus sign. "Broadcasting" is a standard trick used in Python and numpy, its scientific computation library. It extends how normal operations work on matrices with incompatible dimensions. "Broadcasting add" means "if you are adding two matrices but you cannot because their dimensions are not compatible, try to replicate the small one as much as needed to make it work." We finally apply the softmax activation function and obtain the formula describing a 1-layer neural network, applied to 100 images: By the way, what is a "tensor"? ## 5. Theory: gradient descentNow that our neural network produces predictions from input images, we need to measure how good they are, i.e. the distance between what the network tells us and what we know to be the truth. Remember that we have true labels for all the images in this dataset. Any distance would work, the ordinary euclidian distance is fine but for classification problems one distance, called the "cross-entropy" is more efficient. "One-hot" encoding means that you represent the label "6" by using a vector of 10 values, all zeros but the 6th value which is 1. It is handy here because the format is very similar to how our neural network outputs ts predictions, also as a vector of 10 values. "Training" the neural network actually means using training images and labels to adjust weights and biases so as to minimise the cross-entropy loss function. Here is how it works. The cross-entropy is a function of weights, biases, pixels of the training image and its known label. If we compute the partial derivatives of the cross-entropy relatively to all the weights and all the biases we obtain a "gradient", computed for a given image, label and present value of weights and biases. Remember that we have 7850 weights and biases so computing the gradient sounds like a lot of work. Fortunately, TensorFlow will do it for us. The mathematical property of a gradient is that it points "up". Since we want to go where the cross-entropy is low, we go in the opposite direction. We update weights and biases by a fraction of the gradient and do the same thing again using the next batch of training images. Hopefully, this gets us to the bottom of the pit where the cross-entropy is minimal. In this picture, cross-entropy is represented as a function of 2 weights. In reality, there are many more. The gradient descent algorithm follows the path of steepest descent into a local minimum. The training images are changed at each iteration too so that we converge towards a local minimum that works for all images. "Learning rate": you cannot update your weights and biases by the whole length of the gradient at each iteration. It would be like trying to get to the bottom of a valley while wearing seven-league boots. You would be jumping from one side of the valley to the other. To get to the bottom, you need to do smaller steps, i.e. use only a fraction of the gradient, typically in the 1/1000th region. We call this fraction the "learning rate". To sum it up, here is how the training loop looks like:
Why work with "mini-batches" of 100 images and labels ? You can definitely compute your gradient on just one example image and update the weights and biases immediately (it's called "stochastic gradient descent" in scientific literature). Doing so on 100 examples gives a gradient that better represents the constraints imposed by different example images and is therefore likely to converge towards the solution faster. The size of the mini-batch is an adjustable parameter though. There is another, more technical reason: working with batches also means working with larger matrices and these are usually easier to optimise on GPUs. ## Frequently Asked Questions |