PyTorch
Install
Choose the version of pytorch to install, by platform, python version, cuda, etc.
https://pytorch.org/get-started/locally/
E.g. the following wheel files are for windows platform, python 3.5 without cuda.
pip3 install https://download.pytorch.org/whl/cpu/torch-1.1.0-cp35-cp35m-win_amd64.whl
pip3 install https://download.pytorch.org/whl/cpu/torchvision-0.3.0-cp35-cp35m-win_amd64.whl
or download the wheel file to pip install [path to wheel file]
Tensor
pretty much everything in numpy array in available in torch.tensor
x = torch.tensor([1,2,3,4], dtype = torch.float)
or
x = torch.randn(1,4)
To reshape
x = x.view(2,2)
Convert a tensor to numpy or load a numpy into tensor
a = x.numpy()
x = torch.from_numpy(a)
CUDA Tensor
A tensor can be moved onto any device (GPU or CPU)
if torch.cuda.is_available():
device1 = torch.device("cuda") # a CUDA device object
device2 = torch.device("cpu") # a cpu device object
x = torch.randn((1,4), device=device1) # directly create a tensor on GPU
x = x.to(device2) # move x to cpu
Auto Gradient
Tensor supports the calculation of gradient automatically.
Only needs to specify the tensor as requires_grad = True
x = torch.tensor([2.0, 3.0], requires_grad=True)
Specify y as a function of x
y = x*2 + 1
It assigns a gradient function aotomatically to y.
print(y)
tensor([5., 7.], grad_fn=<AddBackward0>)
One may further specify z as a function of y and so forth.
Lets look at y only for now.
The first element of y is y[0].
Calling y[0].backward() triggers the calculation of gradient w.r.t. to x.
y[0].backward()
Print the gradient
x.grad
tensor([2., 0.])
Note that once the backward() function is called, it has already accumulated the gradient back to x.
If you call backward() again, even on the different element y[1], it will throw an error
RuntimeError: Trying to backward through the graph a second time, but the buffers
have already been freed. Specify retain_graph=True when calling backward the first time.
So lets try it again with
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = x ** 2
y[0].backward(retain_graph=True)
print(x.grad)
Will get tensor([4., 0.]) # gradient of y[0] = 2x where x = 2
y[1].backward(retain_graph=True)
print(x.grad)
Will get tensor([4., 6.]) # gradient of y[1] = 2x where x = 3
Note if you might run y[1].backward(retain_graph=True) again!!!
The gradient keeps accumulating even it's wrong.
y[1].backward(retain_graph=True)
print(x.grad)
Will get tensor([ 4., 12.]) # the gradient of y[1] = 6 is added to the original 6 so it's 12 now.
The array form of calculating gradient:
x = torch.tensor([2.0, 3.0], requires_grad=True)
y = x ** 2
y.backward(torch.ones(y.size()))
print(x.grad)
Will get tensor([4., 6.])
The torch.ones means D(y)/D(x) with D(x) = 1
If you run y.backward(torch.tensor([2.0,2.0])), will get tensor([8., 12.]) because D(x)=2 and D(y) is doubled.
torch.no_grad():
Leaving requires_grad=True in the calculations is expensive. It tracks the history and uses a lot of memory.
A variable x might be used for training a model and evaluating a model (e.g. cross validation). When it is training, it needs to track
the history so as to calculate gradient. When it's evaluating a model, it doesn't need to calculate gradient,
but you don't want to turn requires_grad=False because it is still needed for next round of training maybe.
To achieve that, you can also wrap the code block in
with torch.no_grad():
... you codes
More about Auto Grad
A vector valued function Y = f(X) where Y = (y1, y2. .. ym) and X = (x1, x2, ... xn)
Then the gradient of Y with respect to X is a Jacobian matrix
dy1/dx1 dy1/dx2 .... dy1/dxn
dy2/dx1 dy2/dx2 .... dy2/dxn
J = ...
dym/dx1 dym/dx2 .... dym/dxn
If there is a scalar function L = g(Y), e.g. L = sum(Y)
The gradient dL/dY = (dL/dy1, dL/dy2 ... dL/dym)^T
Then the gradient of L with respect to X can be calculated as:
dL/dx1 dy1/dx1 dy2/dx2 .... dym/dxn dL/dy1 dL/dy1
dL/dx2 dy1/dx1 dy2/dx2 .... dym/dxn dL/dy2 dL/dy2
... = ... * ... = J^T * ...
dL/dxn dy1/dx1 dy2/dx2 .... dym/dxn dL/dym dL/dym
This characteristic of vector-Jacobian product makes it very convenient to feed external gradient (dL/dX) into a model that has non-scalar output (dY/dX).
An example for a 3-layer neural network:
x1 y1 z1
x2 y2 z2
.. .. ..
xn ym zk
The loss function is L (scalar function).
The gradient of L with respect to layer Z is [dL/dz1, dL/dz2 ... dL/dzk]^T
The gradient of layer Y with repsect to layer X is dY/dX which is a Jacobian matrix:
dy1/dx1 dy1/dx2 .... dy1/dxn
dy2/dx1 dy2/dx2 .... dy2/dxn
J = ...
dym/dx1 dym/dx2 .... dym/dxn
If you want to calculate the gradient of loss L with respect to layer X, simply apply the chain rule:
dL/dX = J^T * dL/dZ
In PyTorch, it constructs a Dynamic Computation Graph on the fly when you define tensors and functions by tensors, e.g. tensor x and y = x **2.
It runs the feed forward process and keeps everything needed in the buffer for calculating the gradient.
When backward() is called, it calculates the gradient and destroys the buffer unless told not to.
Helloworld Neural Network
The following creates a neural network with two convolutional layers and 3 fully connected layers.
28 x 28 (conv 1 5x5) <--image input
|
24 x 24 (maxpool 1)
|
12 x 12 (conv 2 3x3)
|
10 x 10 (maxpool 2)
|
5 x 5 <--maxpool output, 20 channels of 5 x 5 images, this is not a layer.
|
20 x 5 x 5, 100 <--fully connected layer with 100 neurons
|
200, 50
|
50, 10 <--output
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# first convolutional layer:
# 1 input image channel, 10 output channels, 5x5 square convolution
# the 10 output channels are the input channels for the next layer
# so 10 kernels and 10 biases
# input image is 28 x 28 pixels, 5x5 conv outputs 24 x 24, 2x2 maxpool outputs 12 x 12
# second convolutional layer:
# 10 input channels (12 x 12 images), 20 output channels, 3x3 convolution
# so 20 kernels and 20 biases. Each kernel is applied on all the 10 inputs and results are summed up.
# input image is 12 x 12 pixels, 3x3 conv outputs 10 x 10, 2x2 maxpool outputs 5 x 5
self.conv1 = nn.Conv2d(1, 10, 5)
self.conv2 = nn.Conv2d(10, 20, 3)
# first fully connected layer
# the previous convolutional layer's output channels fully connect to this hidden layer
# 5 x 5 images from the second maxpool
self.fc1 = nn.Linear(20 * 5 * 5, 100)
# second fully connected layer
self.fc2 = nn.Linear(100, 50)
# output layer, use softmax
self.fc3 = nn.Linear(50, 10)
def forward(self, x):
#the shape of x is (N, C, H, W)
#where N is the batch size, there can be multiple training samples in x
# C is the number of channels, there can be multiple channels per sample
# H and W are the height and width of an image, e.g. 28 x 28
# e.g. x = torch.tensor([[t] for t in test_data[:10]], dtype=torch.float32) gets the first 10 training samples
# e.g. x = torch.tensor([[c1, c2] for c1,c2 in channel1[:10], channel2[:10]]) gets the first 10 training samples with 2 channels
#first conv layer
x = self.conv1(x) #first conv
x = F.relu(x) #activation.
x = F.max_pool2d(x, (2, 2)) # 2x2 maxpool
#second conv layer
x = self.conv2(x) #first conv
x = F.relu(x) #
x = F.max_pool2d(x, (2, 2)) # 2x2 maxpool
#first fully connected layer
num_features = x.shape[1] * x.shape[2] * x.shape[3] #ie. C * H * W
x = x.view(-1, num_features) #flatten x into (N, C * H * W)
x = self.fc1(x) #fist fc layer
x = F.relu(x)
#second fully connected layer
x = self.fc2(x)
x = F.relu(x)
#thrid layer, output
x = self.fc3(x)
return x
def num_flat_features(self, x):
size = x.size()[1:] # all dimensions except the batch dimension
num_features = 1
for s in size:
num_features *= s
return num_features
net = Net()
print(net)
SGD
Calculate the gradient and run SGD, following is an made up example
input = torch.randn(10, 1, 28, 28) #random input data with 10 samples, single channel and 28 x 28 images.
target = torch.randn(10, 1, 10) #10 random labels, single channel and 10 output neurons per label
output = net(input) #this runs through the feedforward
criterion = nn.MSELoss() #use mean square error loss function
loss = criterion(output, target) #calculate the loss value
net.zero_grad() #zero the gradients
loss.backward() #autograd, calculate the gradient for all tensors within the network
print(net.conv1.bias.grad) #e.g. print the gradient of the first convolutional layer's bias
#Recall that weight = weight - learning_rate * gradient
learning_rate = 0.01 #set learning rate and adjust weights
for f in net.parameters():
f.data.sub_(f.grad.data * learning_rate) #adjust the weights/biases of every layer by gradient * learning _rate, the sub_ applies changes on f.data itself.
Optimizer
Alternatively, use existing packages to implement gradient descent.
import torch.optim as optim #optimizer
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9) #learning rate, momentum, weight_decay(L2), etc.
optimizer.zero_grad() # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step() # Does the update
Loss function again
The torch.nn has heaps of different loss functions to use. Common ones are mean least square, cross entropy, etc.
Cross Entropy loss function in pytorch
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(inputs, target)
Inputs are the prediction (output) from a model, and have a shape of (N, C) where N the batch size (# of samples) and C is the # of classes.
When there are C classes, it means the output layer has C neurons. For every input sample (row), there are C columns indicating the score (float32) for each class (neuron).
Note the score is not probability. The nn.CrossEntropyLoss uses Softmax to convert score to probability before calculating loss = -log(p).
The target (actual labels) has a shape of (N) so it's a N element row (different shape to the inputs).
Each element is the index of the correct class's neuron. i.e. first class is 0, second class is 1, and so on.
The element's data type is Long (int64).
SGD again
The whole training framework is as follows:
for each epoch:
randomly shuffle training data
break training data into mini batches
for each mini batch:
data, label = training data converted into tensors
net.zero_grad() #zero the parameter gradients for the whole net
output = net(data) #feed forward
loss = cross_entropy(output, label) #calculate the loss
loss.backward() #calculate the gradients
for param in net.prameters():
param.data.sub_(param.grad.data * learning_rate) #adjust the weights by gradient * learning rate
if(test data is available):
print the performance for every epoch.
early stop if needed.
If a GPU and CUDA library is available.
device = torch.device("cuda:0")
net.to(device) #convert the whole network to cuda tensors
inputs, labels = data[0].to(device), data[1].to(device) #make sure data at every step is sent to gpu as well
If you have multiple GPUs, use DataParallel.
Dropout
Dropout simply drops p% of neurons of a layer.
A common practice is placed it after every fully connected layer before output. Some research have dropout on Convolutional layer as well but at a lower rate p ~0.1
In the __init__ method of the model, e.g.
self.fc = nn.Linear(100, 50)
self.dropout = nn.Dropout(p=0.5)
In the forward method,
x = self.fc(x)
x = F.relu(x)
x = self.dropout(x)
The dropout in PyTorch doesn't physically drop the neurons, but zero the neuron's outputs.
The input and output of the dropout module is of the same shape, so it doesn't impact the subsequent layer.
Batch Normalization
Batch normalization is another effective way to generalize a model.
Assuming a mini-batch of m samples X1, X2, .. Xm, the mean u and variance of m is
u = (X1 + X2... + Xm)/m
var = sum((Xi - u)^2)/m
Every sample Xi = (xi1, xi2, .. xik) with k dimensions, calculate the u and var for every dimension.
Assume the u and var of the kth dimension is uk and vark
Then normalize every xik
xik' = (xik - uk) / sqrt(vark + epsilon)
i.e.
xik' = (xik - uk) / standard deviation
Where epsilon is a very small constant to avoid zero denominator.
Here xik is normalized into 0 mean and unit variance distribution.
To restore the representation power of a network, a transformation is needed to further transform xik to
yik' = gamma_k * xik' + beta_k
Imaging the gamma = standard deviation sqrt(vark) and beta = mean uk, then gamma * xik' + beta reverses the previous normalization and yik' = xik.
The gamma and beta here are actually new parameters for the network to learn.
So the Batch Norm layer introduces 2 more parameters per dimension.
Why batch norm works well is still a myth. The original article says batch norm controls the "internal covariate shift" but it has been proven to be wrong.
In experiments that covariate shift is deliberately added to a network after batch norm, the result is still good and better than a network without batch norm.
Another paper says it smooths the objective function so it makes it easier for SGD to work and find a better solution.
As batch norm normalizes a layer's output, it avoids vanishing / exploding gradients. (it always re-align values to a mean-0 normal distribution)
Imagining sigmoid function values, when the value is too far from 0, the gradient tends to be zero, and it becomes even smaller after propagating through a few layers.
Save and load model
model.state_dict()
All layers' weights and biases. Batch norm layers' running means and running variances.
optimizer.state_dict()
optimizer (optim)'s super parameters and momentum buffers, etc.
torch.save(model.state_dict(), 'c:/Temp/model.pth')
Save a model's state to a file. It uses pickle to serialize and a recommended file extension is pth / pt.
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load('c:/Temp/model.pth'))
model.eval()
When loading state to a model, it needs the definition of the model and creates an instance of the model first.
The eval() is telling the model that it is in evaluation mode so any dropout / batch norm doesn't need to do anything.
Alternatively you can save the whole model
torch.save(model, PATH)
# Model class must be defined somewhere
model = torch.load(PATH)
model.eval()