example1

Deep Learning with Torch: the 60-minute blitz


Goal of this talk

  • Understand torch and the neural networks package at a high-level.
  • Train a small neural network on CPU and GPU

What is Torch?

Torch is a scientific computing framework based on Lua[JIT] with strong CPU and CUDA backends.

Strong points of Torch:

  • Efficient Tensor library (like NumPy) with an efficient CUDA backend
  • Neural Networks package -- build arbitrary acyclic computation graphs with automatic differentiation
    • also with fast CUDA and CPU backends
  • Good community and industry support - several hundred community-built and maintained packages.
  • Easy to use Multi-GPU support and parallelizing neural networks

http://torch.ch

https://github.com/torch/torch7/wiki/Cheatsheet

Before getting started

* Based on Lua and runs on Lua-JIT (Just-in-time compiler) which is fast

* Lua is pretty close to javascript.

   * variables are global by default, unless `local` keyword is used

* Only has one data structure built-in, a table: `{}`. Doubles as a hash-table and an array.

* 1-based indexing.

* `foo:bar()` is the same as `foo.bar(foo)`



Getting Started

Strings, numbers, tables - a tiny introduction

In [1]: a = 'hello'

In [2]: print(a)

Out[2]:hello

In [3]:b = {}

In [4]:b[1] = a

In [5]:print(b)

Out[5]:

{

  1 : hello

}

In [6]: b[2] = 30

In [7]: for i=1,#b do -- the # operator is the length operator in Lua

    print(b[i]) 

end

Out[7]:

hello

30

Tensors

In [8]:a = torch.Tensor(5,3) -- construct a 5x3 matrix, uninitialized

In [9]:a = torch.rand(5,3print(a)

Out[9]:

 0.7685  0.5678  0.0411

 0.4389  0.5611  0.2613

 0.5992  0.9157  0.9524

 0.3407  0.5547  0.2985

 0.1578  0.1507  0.1953

[torch.DoubleTensor of size 5x3]

In [10]: b=torch.rand(3,4)

In [11]: a*b  -- matrix-matrix multiplication: syntax 1

Out[11]:

 0.1736  0.8714  0.8363  0.2541

 0.3021  0.5968  0.5468  0.4308

 0.8966  0.9450  0.8224  1.1489

 0.3174  0.5094  0.4575  0.4577

 0.1868  0.2163  0.1927  0.2236

[torch.DoubleTensor of size 5x4]

In [12]: torch.mm(a,b) -- matrix-matrix multiplication: syntax 2

 Out[12]:

 0.1736  0.8714  0.8363  0.2541

 0.3021  0.5968  0.5468  0.4308

 0.8966  0.9450  0.8224  1.1489

 0.3174  0.5094  0.4575  0.4577

 0.1868  0.2163  0.1927  0.2236

[torch.DoubleTensor of size 5x4]

In [13]: 

-- matrix-matrix multiplication: syntax 3

c=torch.Tensor(5,4)

c:mm(a,b) -- store the result of a*b in c

CUDA Tensors

Tensors can be moved onto GPU using the :cuda function

In [14]: 

require 'cutorch';

a = a:cuda()

b = b:cuda()

c = c:cuda()

c:mm(a,b) -- done on GPU

Exercise: Add two tensors

a = torch.ones(5,2)

b = torch.Tensor(2,5):fill(4)

print(a+b)


Out[16]:

 1  1

 1  1

 1  1

 1  1

 1  1

[torch.DoubleTensor of size 5x2]

Neural Networks

Neural networks in Torch can be constructed using the nn package.

In [17]:require 'nn';

Modules are the bricks used to build neural networks. Each are themselves neural networks, but can be combined with other networks using containers to create complex neural networks For example, look at this network that classfies digit images: It is a simple feed-forward network. It takes the input, feeds it through several layers one after the other, and then finally gives the output. Such a network container is nn.Sequential which feeds the input through several layers.

net = nn.Sequential()

net:add(nn.SpatialConvolution(1, 6, 5, 5)) -- 1 input image channel, 6 output channels, 5x5 convolution kernel

net:add(nn.ReLU())                       -- non-linearity 

net:add(nn.SpatialMaxPooling(2,2,2,2))     -- A max-pooling operation that looks at 2x2 windows and finds the max. non-overlapping maxpooling

net:add(nn.SpatialConvolution(6, 16, 5, 5))

net:add(nn.ReLU())                       -- non-linearity 

net:add(nn.SpatialMaxPooling(2,2,2,2))

net:add(nn.View(16*5*5))                    -- reshapes from a 3D tensor of 16x5x5 into 1D tensor of  size 16*5*5

net:add(nn.Linear(16*5*5, 120))             -- fully connected layer (matrix multiplication between input and weights)

net:add(nn.ReLU())                       -- non-linearity 

net:add(nn.Linear(120, 84))

net:add(nn.ReLU())                       -- non-linearity 

net:add(nn.Linear(84, 10))                   -- 10 is the number of outputs of the network (in this case, 10 digits)

net:add(nn.LogSoftMax())                     -- converts the output to a log-probability. Useful for classification problems


print('Lenet5\n' .. net:__tostring());

Out[18]:

Lenet5

nn.Sequential {

  [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> output]

  (1): nn.SpatialConvolution(1 -> 6, 5x5)

  (2): nn.ReLU

  (3): nn.SpatialMaxPooling(2x2, 2,2)

  (4): nn.SpatialConvolution(6 -> 16, 5x5)

  (5): nn.ReLU

  (6): nn.SpatialMaxPooling(2x2, 2,2)

  (7): nn.View(400)

  (8): nn.Linear(400 -> 120)

  (9): nn.ReLU

  (10): nn.Linear(120 -> 84)

  (11): nn.ReLU

  (12): nn.Linear(84 -> 10)

  (13): nn.LogSoftMax

}


Other examples of nn containers are shown in the figure below:


Every neural network module in torch has automatic differentiation. It has a :forward(input) function that computes the output for a given input, flowing the input through the network. and it has a :backward(input, gradient) function that will differentiate each neuron in the network w.r.t. the gradient that is passed in. This is done via the chain rule.

In [19]:input = torch.rand(1,32,32) -- pass a random tensor as input to the network

In [20]:output = net:forward(input)

In [21]:print(output)

Out[21]:

-2.3226

-2.3660

-2.2903

-2.2924

-2.4075

-2.3529

-2.3763

-2.1950

-2.2143

-2.2326

[torch.DoubleTensor of size 10]

In [22]:net:zeroGradParameters() -- zero the internal gradient buffers of the network (will come to this later)

In [23]:gradInput = net:backward(input, torch.rand(10))   -- A backpropagation step consist in computing two kind of gradients at input given gradOutput (gradients with respect to the output of the module). This function simply performs this task using two function calls: A function call to updateGradInput(input, gradOutput)A function call to accGradParameters(input,gradOutput,scale).

In [24]: print(#gradInput)

Out[24]:

  1

 32

 32

[torch.LongStorage of size 3]

Criterion: Defining a loss function

When you want a model to learn to do something, you give it feedback on how well it is doing. This function that computes an objective measure of the model's performance is called a loss function.

A typical loss function takes in the model's output and the groundtruth and computes a value that quantifies the model's performance.

The model then corrects itself to have a smaller loss.

In torch, loss functions are implemented just like neural network modules, and have automatic differentiation.

They have two functions - forward(input, target), backward(input, target)

For example:

In [25]: 

criterion = nn.ClassNLLCriterion() -- a negative log-likelihood criterion for multi-class classification

criterion:forward(output, 3) -- let's say the groundtruth was class number: 3

gradients = criterion:backward(output, 3)

In [26]:gradInput = net:backward(input, gradients)

Review of what you learnt so far

  • Network can have many layers of computation
  • Network takes an input and produces an output in the :forward pass
  • Criterion computes the loss of the network, and it's gradients w.r.t. the output of the network.
  • Network takes an (input, gradients) pair in it's backward pass and calculates the gradients w.r.t. each layer (and neuron) in the network.

Missing details

A neural network layer can have learnable parameters or not.

A convolution layer learns it's convolution kernels to adapt to the input data and the problem being solved.

A max-pooling layer has no learnable parameters. It only finds the max of local windows.

A layer in torch which has learnable weights, will typically have fields .weight (and optionally, .bias)

In [27]: 

m = nn.SpatialConvolution(1,3,2,2) -- learn 3 2x2 kernels

print(m.weight) -- initially, the weights are randomly initialized

Out[27]:

(1,1,.,.) = 

 -0.0272 -0.3090

  0.4946  0.3547

(2,1,.,.) = 

  0.4385  0.1939

  0.0417  0.1340

(3,1,.,.) = 

  0.4814  0.3931

 -0.3507  0.4167

[torch.DoubleTensor of size 3x1x2x2]

In [28]:print(m.bias) -- The operation in a convolution layer is: output = convolution(input,weight) + bias

Out[28]:

 0.3951

-0.1270

-0.0866

[torch.DoubleTensor of size 3]

There are also two other important fields in a learnable layer. The gradWeight and gradBias. The gradWeight accumulates the gradients w.r.t. each weight in the layer, and the gradBias, w.r.t. each bias in the layer.

Training the network

For the network to adjust itself, it typically does this operation (if you do Stochastic Gradient Descent):

weight = weight + learningRate * gradWeight [equation 1]

This update over time will adjust the network weights such that the output loss is decreasing.

Okay, now it is time to discuss one missing piece. Who visits each layer in your neural network and updates the weight according to Equation 1?

There are multiple answers, but we will use the simplest answer.

We shall use the simple SGD trainer shipped with the neural network module: nn.StochasticGradient.

It has a function :train(dataset) that takes a given dataset and simply trains your network by showing different samples from your dataset to the network.

What about data?

Generally, when you have to deal with image, text, audio or video data, you can use standard functions like: image.load or audio.load to load your data into a torch.Tensor or a Lua table, as convenient.

Let us now use some simple data to train our network.

We shall use the CIFAR-10 dataset, which has the classes: 'airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck'.

The images in CIFAR-10 are of size 3x32x32, i.e. 3-channel color images of 32x32 pixels in size.

The dataset has 50,000 training images and 10,000 test images in total.

We now have 5 steps left to do in training our first torch neural network

  1. Load and normalize data
  2. Define Neural Network
  3. Define Loss function
  4. Train network on training data
  5. Test network on test data.

1. Load and normalize data

Today, in the interest of time, we prepared the data before-hand into a 4D torch ByteTensor of size 50000x3x32x32 (training) and 10000x3x32x32 (testing) Let us load the data and inspect it.

In [29]: 

require 'paths'

if (not paths.filep("cifar10torchsmall.zip")) then

    os.execute('wget -c https://s3.amazonaws.com/torch7/data/cifar10torchsmall.zip')

    os.execute('unzip cifar10torchsmall.zip')

end

trainset = torch.load('cifar10-train.t7')

testset = torch.load('cifar10-test.t7')

classes = {'airplane', 'automobile', 'bird', 'cat',

           'deer', 'dog', 'frog', 'horse', 'ship', 'truck'}

Out[29]:

Archive:  cifar10torchsmall.zip

  inflating: cifar10-test.t7         

Out[29]:  inflating: cifar10-train.t7        

In [30]:print(trainset)

Out[30]:

{

  data : ByteTensor - size: 10000x3x32x32

  label : ByteTensor - size: 10000

}

In [31]:print(#trainset.data)

Out[31]:

 10000

     3

    32

    32

[torch.LongStorage of size 4]

For fun, let us display an image:

In [32]:

itorch.image(trainset.data[100]) -- display the 100-th image in dataset

print(classes[trainset.label[100]])

Out[32]:automobile

Now, to prepare the dataset to be used with nn.StochasticGradient, a couple of things have to be done according to it's documentation.

  1. The dataset has to have a :size() function.
  2. The dataset has to have a [i] index operator, so that dataset[i] returns the ith sample in the datset.

Both can be done quickly:

In [33]:-- ignore setmetatable for now, it is a feature beyond the scope of this tutorial. It sets the index operator.

setmetatable(trainset, 

    {__index = function(t, i) 

                    return {t.data[i], t.label[i]} 

                end}

);

trainset.data = trainset.data:double() -- convert the data from a ByteTensor to a DoubleTensor.

function trainset:size() 

    return self.data:size(1

end

In [34]:print(trainset:size()) -- just to test

Out[34]:10000

In [35]: 

print(trainset[33]) -- load sample number 33.

itorch.image(trainset[33][1])

Out[35]:

{

  1 : DoubleTensor - size: 3x32x32

  2 : 2

}

One of the most important things you can do in conditioning your data (in general in data-science or machine learning) is to make your data to have a mean of 0.0 and standard-deviation of 1.0.

Let us do that as a final step of our data processing.

To do this, we introduce you to the tensor indexing operator. It is shown by example:

In [36]:redChannel = trainset.data[{ {}, {1}, {}, {}  }] -- this picks {all images, 1st channel, all vertical pixels, all horizontal pixels}

In [37]:print(#redChannel)

Out[37]:

 10000

     1

    32

    32

[torch.LongStorage of size 4]

In this indexing operator, you initally start with [{ }]. You can pick all elements in a dimension using {} or pick a particular element using {i} where i is the element index. You can also pick a range of elements using {i1, i2}, for example {3,5} gives us the 3,4,5 elements.

Exercise: Select the 150th to 300th data elements of the data

In [38]: 

-- TODO: fill

Moving back to mean-subtraction and standard-deviation based scaling, doing this operation is simple, using the indexing operator that we learnt above:

In [39]:

mean = {} -- store the mean, to normalize the test set in the future

stdv  = {} -- store the standard-deviation for the future

for i=1,3 do -- over each image channel

    mean[i] = trainset.data[{ {}, {i}, {}, {}  }]:mean() -- mean estimation

    print('Channel ' .. i .. ', Mean: ' .. mean[i])

    trainset.data[{ {}, {i}, {}, {}  }]:add(-mean[i]) -- mean subtraction

    

    stdv[i] = trainset.data[{ {}, {i}, {}, {}  }]:std() -- std estimation

    print('Channel ' .. i .. ', Standard Deviation: ' .. stdv[i])

    trainset.data[{ {}, {i}, {}, {}  }]:div(stdv[i]) -- std scaling

end

Out[39]:

Channel 1, Mean: 125.83175029297

Channel 1, Standard Deviation: 63.143400842609

Channel 2, Mean: 123.26066621094

Channel 2, Standard Deviation: 62.369209019002

Channel 3, Mean: 114.03068681641

Channel 3, Standard Deviation: 66.965808411114

As you notice, our training data is now normalized and ready to be used.


2. Time to define our neural network

Exercise: Copy the neural network from the Neural Networks section above and modify it to take 3-channel images (instead of 1-channel images as it was defined).

Hint: You only have to change the first layer, change the number 1 to be 3.

Solution:

In [40]: 

net = nn.Sequential()

net:add(nn.SpatialConvolution(3, 6, 5, 5)) -- 3 input image channels, 6 output channels, 5x5 convolution kernel

net:add(nn.ReLU())                       -- non-linearity 

net:add(nn.SpatialMaxPooling(2,2,2,2))     -- A max-pooling operation that looks at 2x2 windows and finds the max.

net:add(nn.SpatialConvolution(6, 16, 5, 5))

net:add(nn.ReLU())                       -- non-linearity 

net:add(nn.SpatialMaxPooling(2,2,2,2))

net:add(nn.View(16*5*5))                    -- reshapes from a 3D tensor of 16x5x5 into 1D tensor of 16*5*5

net:add(nn.Linear(16*5*5, 120))             -- fully connected layer (matrix multiplication between input and weights)

net:add(nn.ReLU())                       -- non-linearity 

net:add(nn.Linear(120, 84))

net:add(nn.ReLU())                       -- non-linearity 

net:add(nn.Linear(84, 10))                   -- 10 is the number of outputs of the network (in this case, 10 digits)

net:add(nn.LogSoftMax())                     -- converts the output to a log-probability. Useful for classification problems

3. Let us define the Loss function

Let us use a Log-likelihood classification loss. It is well suited for most classification problems.

In [41]:criterion = nn.ClassNLLCriterion()

4. Train the neural network

This is when things start to get interesting.

Let us first define an nn.StochasticGradient object. Then we will give our dataset to this object's :train function, and that will get the ball rolling.

In [42]: 

trainer = nn.StochasticGradient(net, criterion)

trainer.learningRate = 0.001

trainer.maxIteration = 5 -- just do 5 epochs of training.

In [43]:trainer:train(trainset)

Out[43]:

# StochasticGradient: training

# current error = 2.1450560652679

# current error = 1.8109001357508

# current error = 1.6416001125639

# current error = 1.5412038401354

# current error = 1.4593692280714

# StochasticGradient: you have reached the maximum number of iterations

# training error = 1.4593692280714

5. Test the network, print accuracy

We have trained the network for 2 passes over the training dataset.

But we need to check if the network has learnt anything at all.

We will check this by predicting the class label that the neural network outputs, and checking it against the ground-truth. If the prediction is correct, we add the sample to the list of correct predictions.

Okay, first step. Let us display an image from the test set to get familiar.

In [44]: 

print(classes[testset.label[100]])

itorch.image(testset.data[100])

Out[44]:horse

Now that we are done with that, let us normalize the test data with the mean and standard-deviation from the training data.

In [45]:

testset.data = testset.data:double()   -- convert from Byte tensor to Double tensor

for i=1,3 do -- over each image channel

    testset.data[{ {}, {i}, {}, {}  }]:add(-mean[i]) -- mean subtraction    

    testset.data[{ {}, {i}, {}, {}  }]:div(stdv[i]) -- std scaling

end

In [46]: 

-- for fun, print the mean and standard-deviation of example-100

horse = testset.data[100]

print(horse:mean(), horse:std())

Out[46]:0.59066009532189 1.0665356205025

Okay, now let us see what the neural network thinks these examples above are:

In [47]: 

print(classes[testset.label[100]])

itorch.image(testset.data[100])

predicted = net:forward(testset.data[100])

Out[47]:horse

In [48]:print(predicted:exp()) -- the output of the network is Log-Probabilities. To convert them to probabilities, you have to take e^x 

Out[48]:

 0.0152

 0.0131

 0.0707

 0.0862

 0.0815

 0.0888

 0.0725

 0.4044

 0.0058

 0.1618

[torch.DoubleTensor of size 10]

You can see the network predictions. The network assigned a probability to each classes, given the image.

To make it clearer, let us tag each probability with it's class-name:

In [49]: 

for i=1,predicted:size(1) do

    print(classes[i], predicted[i])

end



Out[49]:

airplane 0.015171323533291

automobile 0.013053149574058

bird 0.070705943251212

cat 0.08624462169525

deer 0.081496902509075

dog 0.088808268367215

frog 0.072485396958548

horse 0.40441363678814

ship 0.0058456059974606

truck 0.16177515132574

Alright, fine. One single example sucked, but how many in total seem to be correct over the test set?

In [50]: 

correct = 0

for i=1,10000 do

    local groundtruth = testset.label[i]

    local prediction = net:forward(testset.data[i])

    local confidences, indices = torch.sort(prediction, true-- true means sort in descending order

    if groundtruth == indices[1] then

        correct = correct + 1

    end

end

In [51]:print(correct, 100*correct/10000 .. ' % ')

Out[51]:4481 44.81 % 

That looks waaay better than chance, which is 10% accuracy (randomly picking a class out of 10 classes). Seems like the network learnt something.

Hmmm, what are the classes that performed well, and the classes that did not perform well:

In [52]: 

class_performance = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0}

for i=1,10000 do

    local groundtruth = testset.label[i]

    local prediction = net:forward(testset.data[i])

    local confidences, indices = torch.sort(prediction, true-- true means sort in descending order

    if groundtruth == indices[1] then

        class_performance[groundtruth] = class_performance[groundtruth] + 1

    end

end

In [53]: 

for i=1,#classes do

    print(classes[i], 100*class_performance[i]/1000 .. ' %')

end

Out[53]:

airplane 28 %

automobile 51.3 %

bird 17.8 %

cat 17.4 %

deer 50 %

dog 40.6 %

frog 66.8 %

horse 62.9 %

ship 56.2 %

truck 57.1 %

Okay, so what next? How do we run this neural network on GPUs?

cunn: neural networks on GPUs using CUDA

In [54]:require 'cunn';

The idea is pretty simple. Take a neural network, and transfer it over to GPU:

In [55]:net = net:cuda()

Also, transfer the criterion to GPU:

In [56]:criterion = criterion:cuda()

Ok, now the data:

In [57]: 

trainset.data = trainset.data:cuda()

trainset.label = trainset.label:cuda()

Okay, let's train on GPU :) #sosimple

In [58]: 

trainer = nn.StochasticGradient(net, criterion)

trainer.learningRate = 0.001

trainer.maxIteration = 5 -- just do 5 epochs of training.

In [59]:trainer:train(trainset)

Out[59]:

# StochasticGradient: training

# current error = 1.3934907696846

# current error = 1.3217509180375

# current error = 1.2544273688514

# current error = 1.1865880507031

# current error = 1.1181808366985

# StochasticGradient: you have reached the maximum number of iterations

# training error = 1.1181808366985

Why dont I notice MASSIVE speedup compared to CPU? Because your network is realllly small.

Exercise: Try increasing the size of your network (argument 1 and 2 of nn.SpatialConvolution(...), see what kind of speedup you get.

Goals achieved:

  • Understand torch and the neural networks package at a high-level.
  • Train a small neural network on CPU and GPU

Where do I go next?






Comments