Trang chủ‎ > ‎IT‎ > ‎Data Science - Python‎ > ‎Tensorflow‎ > ‎

Tensorflow and Deep Learning part 3

11Theory: convolutional networks

In a layer of a convolutional network, one "neuron" does a weighted sum of the pixels just above it, across a small region of the image only. It then acts normally by adding a bias and feeding the result through its activation function. The big difference is that each neuron reuses the same weights whereas in the fully-connected networks seen previously, each neuron had its own set of weights.

In the animation above, you can see that by sliding the patch of weights across the image in both directions (a convolution) you obtain as many output values as there were pixels in the image (some padding is necessary at the edges though).

To generate one plane of output values using a patch size of 4x4 and a color image as the input, as in the animation, we need 4x4x3=48 weights. That is not enough. To add more degrees of freedom, we repeat the same thing with a different set of weights.

The two (or more) sets of weights can be rewritten as one by adding a dimension to the tensor and this gives us the generic shape of the weights tensor for a convolutional layer. Since the number of input and output channels are parameters, we can start stacking and chaining convolutional layers.

One last issue remains. We still need to boil the information down. In the last layer, we still want only 10 neurons for our 10 classes of digits. Traditionally, this was done by a "max-pooling" layer. Even if there are simpler ways today, "max-pooling" helps understand intuitively how convolutional networks operate: if you assume that during training, our little patches of weights evolve into filters that recognise basic shapes (horizontal and vertical lines, curves, ...) then one way of boiling useful information down is to keep through the layers the outputs where a shape was recognised with the maximum intensity. In practice, in a max-pool layer neuron outputs are processed in groups of 2x2 and only the one max one retained.

There is a simpler way though: if you slide the patches across the image with a stride of 2 pixels instead of 1, you also obtain fewer output values. This approach has proven just as effective and today's convolutional networks use convolutional layers only.

Let us build a convolutional network for handwritten digit recognition. We will use three convolutional layers at the top, our traditional softmax readout layer at the bottom and connect them with one fully-connected layer:

Notice that the second and third convolutional layers have a stride of two which explains why they bring the number of output values down from 28x28 to 14x14 and then 7x7. The sizing of the layers is done so that the number of neurons goes down roughly by a factor of two at each layer: 28x28x4≈3000 → 14x14x8≈1500 → 7x7x12≈500 → 200. Jump to the next section for the implementation.

12Lab: a convolutional network

To switch our code to a convolutional model, we need to define appropriate weights tensors for the convolutional layers and then add the convolutional layers to the model.

We have seen that a convolutional layer requires a weights tensor of the following shape. Here is the TensorFlow syntax for their initialisation:

W = tf.Variable(tf.truncated_normal([4, 4, 3, 2], stddev=0.1))
B = tf.Variable(tf.ones([2])/10) # 2 is the number of output channels

Convolutional layers can be implemented in TensorFlow using the tf.nn.conv2d function which performs the scanning of the input image in both directions using the supplied weights. This is only the weighted sum part of the neuron. You still need to add a bias and feed the result through an activation function.

stride = 1  # output is still 28x28
Ycnv = tf.nn.conv2d(X, W, strides=[1, stride, stride, 1], padding='SAME')
Y = tf.nn.relu(Ycnv + B)

Do not pay too much attention to the complex syntax for the stride. Look up the documentation for full details. The padding strategy that works here is to copy pixels from the sides of the image. All digits are on a uniform background so this just extends the background and should not add any unwanted shapes.

Your turn to play. Modify your model to turn it into a convolutional model. You can use the values from the drawing above to size it. You can keep your learning rate decay as it was but please remove dropout at this point.

The solution can be found in file Use it if you are stuck.

Your model should break the 98% barrier comfortably and end up just a hair under 99%. We cannot stop so close! Look at the test cross-entropy curve. Does a solution spring to your mind ?

13Lab: the 99% challenge

A good approach to sizing your neural networks is to implement a network that is a little too constrained, then give it a bit more degrees of freedom and add dropout to make sure it is not overfitting. This ends up with a fairly optimal network for your problem.

Here for example, we used only 4 patches in the first convolutional layer. If you accept that those patches of weights evolve during training into shape recognisers, you can intuitively see that this might not be enough for our problem. Handwritten digits are mode from more than 4 elemental shapes.

So let us bump up the patch sizes a little, increase the number of patches in our convolutional layers from 4, 8, 12 to 6, 12, 24 and then add dropout on the fully-connected layer. Why not on the convolutional layers? Their neurons reuse the same weights, so dropout, which effectively works by freezing some weights during one training iteration, would not work on them.

Go for it and break the 99% limit. Increase the patch sizes and channel numbers as on the picture above and add dropout on the convolutional layer.

The solution can be found in file Use it if you are stuck.

The model pictured above misses only 72 out of the 10,000 test digits. The world record, which you can find on the MNIST website is around 99.7%. We are only 0.4 percentage points away from it with our model built with 100 lines of Python / TensorFlow.

To finish, here is the difference dropout makes to our bigger convolutional network. Giving the neural network the additional degrees of freedom it needed bumped the final accuracy from 98.9% to 99.1%. Adding dropout not only tamed the test loss but also allowed us to sail safely above 99% and even reach 99.3%


You have built your first neural network and trained it all the way to 99% accuracy. The techniques learned along the way are not specific to the MNIST dataset, actually they are very widely used when working with neural networks. As a parting gift, here is the "cliff's notes" card for the lab, in cartoon version. You can use it to recall what you have learned:

Next steps

  • After fully-connected and convolutional networks, you should have a look at recurrent neural networks.
  • In this tutorial, you have learned how to build a Tensorflow model at the matrix level. Tensorflow has higher-level APIs too called tf.learn.
  • To run your training or inference in the cloud on a distributed infrastructure, we provide the Cloud ML service.
  • Finally, we love feedback. Please tell us if you see something amiss in this lab or if you think it should be improved. We handle feedback through GitHub issues [feedback link].

The author: Martin Görner
Twitter: @martin_gorner
Google +:

All cartoon images in this lab copyright: alexpokusay / 123RF stock photos