PyTorch

Install

Choose the version of pytorch to install, by platform, python version, cuda, etc.

https://pytorch.org/get-started/locally/

E.g. the following wheel files are for windows platform, python 3.5 without cuda.

pip3 install https://download.pytorch.org/whl/cpu/torch-1.1.0-cp35-cp35m-win_amd64.whl

pip3 install https://download.pytorch.org/whl/cpu/torchvision-0.3.0-cp35-cp35m-win_amd64.whl

or download the wheel file to pip install [path to wheel file]

Tensor

pretty much everything in numpy array in available in torch.tensor

    x = torch.tensor([1,2,3,4], dtype = torch.float)

or

    x = torch.randn(1,4)

To reshape

    x = x.view(2,2)


Convert a tensor to numpy or load a numpy into tensor

    a = x.numpy()

x = torch.from_numpy(a)


CUDA Tensor

A tensor can be moved onto any device (GPU or CPU)

  if torch.cuda.is_available():

    device1 = torch.device("cuda")          # a CUDA device object

device2 = torch.device("cpu")           # a cpu device object

    x = torch.randn((1,4), device=device1)  # directly create a tensor on GPU

    x = x.to(device2)                       # move x to cpu


Auto Gradient

Tensor supports the calculation of gradient automatically.

Only needs to specify the tensor as requires_grad = True

    x = torch.tensor([2.0, 3.0], requires_grad=True)

Specify y as a function of x

    y = x*2 + 1

It assigns a gradient function aotomatically to y.

    print(y)

tensor([5., 7.], grad_fn=<AddBackward0>)


One may further specify z as a function of y and so forth.

Lets look at y only for now.

The first element of y is y[0].

Calling y[0].backward() triggers the calculation of gradient w.r.t. to x.

    y[0].backward()

Print the gradient

    x.grad

tensor([2., 0.])

Note that once the backward() function is called, it has already accumulated the gradient back to x.

If you call backward() again, even on the different element y[1], it will throw an error

    RuntimeError: Trying to backward through the graph a second time, but the buffers 

have already been freed. Specify retain_graph=True when calling backward the first time.


So lets try it again with

    x = torch.tensor([2.0, 3.0], requires_grad=True)

y = x ** 2

y[0].backward(retain_graph=True)

print(x.grad)

Will get tensor([4., 0.])     # gradient of y[0] = 2x where x = 2

    y[1].backward(retain_graph=True)

print(x.grad)

Will get tensor([4., 6.])     # gradient of y[1] = 2x where x = 3

Note if you might run y[1].backward(retain_graph=True) again!!!

The gradient keeps accumulating even it's wrong.

    y[1].backward(retain_graph=True)

print(x.grad)

Will get tensor([ 4., 12.])   # the gradient of y[1] = 6 is added to the original 6 so it's 12 now.

The array form of calculating gradient:

    x = torch.tensor([2.0, 3.0], requires_grad=True)

y = x ** 2

    y.backward(torch.ones(y.size()))

print(x.grad)

Will get tensor([4., 6.])

The torch.ones means D(y)/D(x) with D(x) = 1

If you run y.backward(torch.tensor([2.0,2.0])), will get tensor([8., 12.]) because D(x)=2 and D(y) is doubled.

torch.no_grad():

Leaving requires_grad=True in the calculations is expensive. It tracks the history and uses a lot of memory.

A variable x might be used for training a model and evaluating a model (e.g. cross validation). When it is training, it needs to track

the history so as to calculate gradient. When it's evaluating a model, it doesn't need to calculate gradient, 

but you don't want to turn requires_grad=False because it is still needed for next round of training maybe.

To achieve that, you can also wrap the code block in 

    with torch.no_grad():

   ... you codes

More about Auto Grad

A vector valued function Y = f(X)  where Y = (y1, y2. .. ym) and X = (x1, x2, ... xn)

Then the gradient of Y with respect to X is a Jacobian matrix

        dy1/dx1 dy1/dx2 .... dy1/dxn

dy2/dx1 dy2/dx2 .... dy2/dxn

   J = ...

        dym/dx1 dym/dx2 .... dym/dxn   


If there is a scalar function L = g(Y), e.g. L = sum(Y)

The gradient dL/dY = (dL/dy1, dL/dy2 ... dL/dym)^T

Then the gradient of L with respect to X can be calculated as:

    dL/dx1    dy1/dx1 dy2/dx2 .... dym/dxn   dL/dy1          dL/dy1   

dL/dx2    dy1/dx1 dy2/dx2 .... dym/dxn   dL/dy2          dL/dy2

    ... = ...                          * ...     = J^T * ...

    dL/dxn    dy1/dx1 dy2/dx2 .... dym/dxn   dL/dym          dL/dym


This characteristic of vector-Jacobian product makes it very convenient to feed external gradient (dL/dX) into a model that has non-scalar output (dY/dX).

An example for a 3-layer neural network:

x1    y1    z1

x2    y2    z2

..    ..    ..

xn    ym    zk

The loss function is L (scalar function).

The gradient of L with respect to layer Z is [dL/dz1, dL/dz2 ... dL/dzk]^T

The gradient of layer Y with repsect to layer X is dY/dX which is a Jacobian matrix:

        dy1/dx1 dy1/dx2 .... dy1/dxn

dy2/dx1 dy2/dx2 .... dy2/dxn

   J = ...

        dym/dx1 dym/dx2 .... dym/dxn 

If you want to calculate the gradient of loss L with respect to layer X, simply apply the chain rule:

   dL/dX = J^T * dL/dZ

   

In PyTorch, it constructs a Dynamic Computation Graph on the fly when you define tensors and functions by tensors, e.g. tensor x and y = x **2.

It runs the feed forward process and keeps everything needed in the buffer for calculating the gradient.

When backward() is called, it calculates the gradient and destroys the buffer unless told not to.

Helloworld Neural Network

The following creates a neural network with two convolutional layers and 3 fully connected layers.

28 x 28 (conv 1 5x5) <--image input

|

24 x 24 (maxpool 1)

|

12 x 12 (conv 2 3x3)

|

10 x 10 (maxpool 2)

|

5 x 5    <--maxpool output, 20 channels of 5 x 5 images, this is not a layer. 

|

20 x 5 x 5, 100  <--fully connected layer with 100 neurons

|

200, 50

|

50, 10  <--output

import torch.nn as nn

import torch.nn.functional as F

class Net(nn.Module):

    def __init__(self):

        super(Net, self).__init__()

        

        # first convolutional layer:

        #    1 input image channel, 10 output channels, 5x5 square convolution

        #    the 10 output channels are the input channels for the next layer

        #    so 10 kernels and 10 biases

        #    input image is 28 x 28 pixels, 5x5 conv outputs 24 x 24, 2x2 maxpool outputs 12 x 12

        # second convolutional layer:

        #    10 input channels (12 x 12 images), 20 output channels, 3x3 convolution

        #    so 20 kernels and 20 biases. Each kernel is applied on all the 10 inputs and results are summed up.

        #    input image is 12 x 12 pixels, 3x3 conv outputs 10 x 10, 2x2 maxpool outputs 5 x 5

        self.conv1 = nn.Conv2d(1, 10, 5)

        self.conv2 = nn.Conv2d(10, 20, 3)

        # first fully connected layer

        #    the previous convolutional layer's output channels fully connect to this hidden layer

        #    5 x 5 images from the second maxpool

        self.fc1 = nn.Linear(20 * 5 * 5, 100) 

        

        # second fully connected layer

        self.fc2 = nn.Linear(100, 50)

        

        # output layer, use softmax

        self.fc3 = nn.Linear(50, 10)

    def forward(self, x):

        #the shape of x is (N, C, H, W)

        #where N is the batch size, there can be multiple training samples in x

        # C is the number of channels, there can be multiple channels per sample

        # H and W are the height and width of an image, e.g. 28 x 28

        # e.g. x = torch.tensor([[t] for t in test_data[:10]], dtype=torch.float32) gets the first 10 training samples

        # e.g. x = torch.tensor([[c1, c2] for c1,c2 in channel1[:10], channel2[:10]]) gets the first 10 training samples with 2 channels

        

        #first conv layer

        x = self.conv1(x) #first conv

        x = F.relu(x)     #activation.

        x = F.max_pool2d(x, (2, 2)) # 2x2 maxpool

        

        #second conv layer

        x = self.conv2(x) #first conv

        x = F.relu(x)     #

        x = F.max_pool2d(x, (2, 2)) # 2x2 maxpool

        

        #first fully connected layer

        num_features = x.shape[1] * x.shape[2] * x.shape[3]  #ie. C * H * W

        x = x.view(-1, num_features) #flatten x into (N, C * H * W)

        x = self.fc1(x)   #fist fc layer

        x = F.relu(x) 

        

        #second fully connected layer

        x = self.fc2(x)

        x = F.relu(x)

        

        #thrid layer, output

        x = self.fc3(x)

        

        return x

    def num_flat_features(self, x):

        size = x.size()[1:]  # all dimensions except the batch dimension

        num_features = 1

        for s in size:

            num_features *= s

        return num_features

net = Net()

print(net)

SGD

Calculate the gradient and run SGD, following is an made up example

input = torch.randn(10, 1, 28, 28)  #random input data with 10 samples, single channel and 28 x 28 images.

target = torch.randn(10, 1, 10)     #10 random labels, single channel and 10 output neurons per label

output = net(input)                 #this runs through the feedforward

criterion = nn.MSELoss()            #use mean square error loss function

loss = criterion(output, target)    #calculate the loss value

net.zero_grad()             #zero the gradients

loss.backward()             #autograd, calculate the gradient for all tensors within the network

print(net.conv1.bias.grad)  #e.g. print the gradient of the first convolutional layer's bias

#Recall that weight = weight - learning_rate * gradient

learning_rate = 0.01        #set learning rate and adjust weights

for f in net.parameters():

    f.data.sub_(f.grad.data * learning_rate) #adjust the weights/biases of every layer by gradient * learning _rate, the sub_ applies changes on f.data itself.

Optimizer

Alternatively, use existing packages to implement gradient descent.

import torch.optim as optim                                      #optimizer

optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) #learning rate, momentum, weight_decay(L2), etc.

optimizer.zero_grad()                                            # zero the gradient buffers

output = net(input)

loss = criterion(output, target)

loss.backward()

optimizer.step()    # Does the update

Loss function again

The torch.nn has heaps of different loss functions to use. Common ones are mean least square, cross entropy, etc.

Cross Entropy loss function in pytorch

    loss_fn = nn.CrossEntropyLoss()

    loss = loss_fn(inputs, target)

Inputs are the prediction (output) from a model, and have a shape of (N, C) where N the batch size (# of samples) and C is the # of classes.

When there are C classes, it means the output layer has C neurons. For every input sample (row), there are C columns indicating the score (float32) for each class (neuron). 

Note the score is not probability. The nn.CrossEntropyLoss uses Softmax to convert score to probability before calculating loss = -log(p).

The target (actual labels) has a shape of (N) so it's a N element row (different shape to the inputs).

Each element is the index of the correct class's neuron. i.e. first class is 0, second class is 1, and so on.

The element's data type is Long (int64).

SGD again

The whole training framework is as follows:

for each epoch:

    randomly shuffle training data

break training data into mini batches

for each mini batch:

   data, label = training data converted into tensors

net.zero_grad()     #zero the parameter gradients for the whole net

output = net(data)  #feed forward

loss = cross_entropy(output, label)  #calculate the loss

loss.backward()     #calculate the gradients

for param in net.prameters():

   param.data.sub_(param.grad.data * learning_rate) #adjust the weights by gradient * learning rate

if(test data is available):

        print the performance for every epoch.

        early stop if needed.


If a GPU and CUDA library is available.

device = torch.device("cuda:0")

net.to(device)                                           #convert the whole network to cuda tensors

inputs, labels = data[0].to(device), data[1].to(device)  #make sure data at every step is sent to gpu as well

If you have multiple GPUs, use DataParallel.

Dropout 

Dropout simply drops p% of neurons of a layer.

A common practice is placed it after every fully connected layer before output. Some research have dropout on Convolutional layer as well but at a lower rate p ~0.1

In the __init__ method of the model, e.g.

   self.fc = nn.Linear(100, 50)

   self.dropout = nn.Dropout(p=0.5)

In the forward method, 

   x = self.fc(x)

   x = F.relu(x)

   x = self.dropout(x)

The dropout in PyTorch doesn't physically drop the neurons, but zero the neuron's outputs.

The input and output of the dropout module is of the same shape, so it doesn't impact the subsequent layer.

 

Batch Normalization

Batch normalization is another effective way to generalize a model.

Assuming a mini-batch of m samples X1, X2, .. Xm, the mean u and variance of m is

   u = (X1 + X2... + Xm)/m

   var = sum((Xi - u)^2)/m

Every sample Xi = (xi1, xi2, .. xik) with k dimensions, calculate the u and var for every dimension.

Assume the u and var of the kth dimension is uk and vark

Then normalize every xik

   xik' = (xik - uk) / sqrt(vark + epsilon)

    i.e.

xik' = (xik - uk) / standard deviation

Where epsilon is a very small constant to avoid zero denominator.

Here xik is normalized into 0 mean and unit variance distribution.

To restore the representation power of a network, a transformation is needed to further transform xik to

   yik' = gamma_k * xik' + beta_k

   

Imaging the gamma = standard deviation sqrt(vark) and beta = mean uk, then gamma * xik' + beta reverses the previous normalization and yik' = xik.

The gamma and beta here are actually new parameters for the network to learn.

So the Batch Norm layer introduces 2 more parameters per dimension.

Why batch norm works well is still a myth. The original article says batch norm controls the "internal covariate shift" but it has been proven to be wrong.

In experiments that covariate shift is deliberately added to a network after batch norm, the result is still good and better than a network without batch norm.

Another paper says it smooths the objective function so it makes it easier for SGD to work and find a better solution.

As batch norm normalizes a layer's output, it avoids vanishing / exploding gradients. (it always re-align values to a mean-0 normal distribution)

Imagining sigmoid function values, when the value is too far from 0, the gradient tends to be zero, and it becomes even smaller after propagating through a few layers.

Save and load model

model.state_dict()

All layers' weights and biases. Batch norm layers' running means and running variances.

optimizer.state_dict()

optimizer (optim)'s super parameters and momentum buffers, etc.

torch.save(model.state_dict(), 'c:/Temp/model.pth')

Save a model's state to a file. It uses pickle to serialize and a recommended file extension is pth / pt.

model = TheModelClass(*args, **kwargs)

model.load_state_dict(torch.load('c:/Temp/model.pth'))

model.eval()

When loading state to a model, it needs the definition of the model and creates an instance of the model first.

The eval() is telling the model that it is in evaluation mode so any dropout / batch norm doesn't need to do anything.

Alternatively you can save the whole model

torch.save(model, PATH)

# Model class must be defined somewhere

model = torch.load(PATH)

model.eval()