### Handwritten digit recognition with Convolutional Neural Networks

Requirements:

Torch5 machine learning library
lua language interpreter (which comes with torch).

Credits

David Grangier suggested me this problem as a way of getting familiar with torch5 library (during my 2010 Summer internship at NEC Labs).

:

data: usps.zip
code: usps_cnn.lua

Description

For example, the images of zeros and eights look like this:

Each image contains a total 1100 examples of that digit (organized in 33 columns and 34 rows). Each example is 16x16 pixel. We will use 1000 samples for training and the remaining for testing.

The first layer of the network is a set of local filters that are applied convolutionally across the image. This is followed by a sub-sampling layer to reduce data dimensionality and introduce a bit of translation invariance. This is followed by a non-linear transfer function (hyperbolic tangent), which keeps the responses of that layer bounded to [-1,1]. We then have a linear layer with 10 outputs (one for each digit). Finally, a SoftMax operation keeps values in [0,1] and the sum over all classes is 1, so these can be interpreted as probabilities (of data belonging to that class). Here we actually use a LogSoftMax which gives log-probabilities.

Creating the neural network architecture in torch5 is quite simple:

function create_network(nb_outputs)

`local ann = nn.Sequential();  `
`                      -- input is 16x16x1`
`ann:add(nn.SpatialConvolution(1,6,5,5))   -- becomes  12x12x6`
`ann:add(nn.SpatialSubSampling(6,2,2,2,2)) -- becomes  6x6x6 `

`ann:add(nn.Reshape(6*6*6))`
`ann:add(nn.Tanh())`
`ann:add(nn.Linear(6*6*6,nb_outputs))`
`ann:add(nn.LogSoftMax())`

`return ann`
`end`

The negative log likelihood criterion is our loss function for the multi-class problem, and we train the network using stochastic gradient descent:

`function train_network( network, dataset)`
`print( "Training network" )`
`local criterion = nn.ClassNLLCriterion()`
`for iteration=1,maxIterations do`
`local index = 1 + random.random() % dataset:size() -- pick example at random`
`local input = dataset[index][1]`
`local output = dataset[index][2]`
`criterion:forward(network:forward(input), output)`

`network:zeroGradParameters()`
`network:backward(input, criterion:backward(network.output, output))`
`network:updateParameters(learningRate)`
`end`
`end`

Performance

In this setup, I obtained test errors around 3% or 4%. Note that training the network with stochastic gradient descent will give you slightly different results each time you do it (because there is randomness selecting the next training sample).

Possible improvements:

I tried to keep this tutorial as simple as possible, but one can expand it in several directions:

Speeding-up training:
• Normalize data before training
• Learning rate dependent on number of neurons in layer
Model selection:
• Use a validation dataset to avoid over-fitting
Visualization:
• Show current error during training
• Display wrong classified examples at test time
• Imagine you want now to recognize letters. One can jointly train two networks (one for digits and another for letters) and share the first layer parameters of the networks. This is really easy in torch5, using the clone and share methods.

Y. LeCun, L. Bottou, G. Orr and K. Muller: Efficient BackProp, in Orr, G. and Muller K. (Eds), Neural Networks: Tricks of the trade, Springer, 1998 (download pdf)