## Supervised LearningIn this tutorial, we're going to learn how to define a model, and train it using a supervised approach, to solve a multiclass classifaction task. Some of the material here is based on this existing tutorial. The tutorial demonstrates how to: - pre-process the (train and test) data, to facilitate learning
- describe a model to solve a classification task
- choose a loss function to minimize
- define a sampling procedure (stochastic, mini-batches), and apply one of several optimization techniques to train the model's parameters
- estimate the model's performance on unseen (test) data
Each of these 5 steps is accompanied by a script, provided on GitHub, on this page: - 1_data.lua
- 2_model.lua
- 3_loss.lua
- 4_train.lua
- 5_test.lua
A top script, At the end of each section, I propose a couple of exercises, which are mostly intended to make you modify the code, and get a good idea of the effect of each parameter on the global procedure. Although the exercises are proposed at the end of each section, they should be done after you've read the complete tutorial, as they (almost) all require you to run the The complete dataset is big, and we don't have time to play with the full set in this short tutorial session. The script The example scripts provided are quite verbose, on purpose. Instead of relying on opaque classes, dataset creation and the training loop are basically exposed right here. Although a bit challenging at first, it should help new users quickly become independent, and able to tweak the code for their own problems. On top of the scripts above, I provide an extra script, You can now follow these steps, in order: ## Step 1: DataThe code for this section is in th -i 1_data.lua This will give you an interpreter to play with the data once it's loaded/preprocessed. For this tutorial, we'll be using the Street View House Number http://ufldl.stanford.edu/housenumbers/ dataset. SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST (e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images. Overview of the dataset: - 10 classes, 1 for each digit. Digit '1' has label 1, '9' has label 9 and '0' has label 10.
- 73257 digits for training, 26032 digits for testing, and 531131 additional, somewhat less difficult samples, to use as extra training data
- Comes in two formats:
- Original images with character level bounding boxes.
- MNIST-like 32-by-32 images centered around a single character (many of the images do contain some distractors at the sides).
We will be using the second format. In terms of dimensionality: - the inputs (images) are 3x32x32
- the outputs (targets) are 10-dimensional
In this first section, we are going to preprocess the data to facilitate training. The script provided automatically retrieves the dataset, all we have to do is load it: 1: -- We load the dataset from disk, and re-arrange it to be compatible 2: -- with Torch's representation. Matlab uses a column-major representation, 3: -- Torch is row-major, so we just have to transpose the data. 4: 5: -- Note: the data, in X, is 4-d: the 1st dim indexes the samples, the 2nd 6: -- dim indexes the color channels (RGB), and the last two dims index the 7: -- height and width of the samples. 8: 9: loaded = torch.load(train_file,'ascii') 10: trainData = { 11: data = loaded.X:transpose(3,4), 12: labels = loaded.y[1], 13: size = function() return (#trainData.data)[1] end 14: } 15: 16: loaded = torch.load(test_file,'ascii') 17: testData = { 18: data = loaded.X:transpose(3,4), 19: labels = loaded.y[1], 20: size = function() return (#testData.data)[1] end 21: } Preprocessing requires a floating point representation (the original data is stored on bytes). Types can be easily converted in Torch, in general by doing: trainData.data = trainData.data:float() testData.data = testData.data:float() We now preprocess the data. Preprocessing is crucial when applying pretty much any kind of machine learning algorithm. For natural images, we use several intuitive tricks: - images are mapped into YUV space, to separate luminance information from color information
- the luminance channel (Y) is locally normalized, using a contrastive normalization operator: for each neighborhood, defined by a Gaussian kernel, the mean is suppressed, and the standard deviation is normalized to one.
- color channels are normalized globally, across the entire dataset; as a result, each color component has 0-mean and 1-norm across the dataset.
1: -- Convert all images to YUV 2: print '==> preprocessing data: colorspace RGB -> YUV' 3: for i = 1,trainData:size() do 4: trainData.data[i] = image.rgb2yuv(trainData.data[i]) 5: end 6: for i = 1,testData:size() do 7: testData.data[i] = image.rgb2yuv(testData.data[i]) 8: end 9: 10: -- Name channels for convenience 11: channels = {'y','u','v'} 12: 13: -- Normalize each channel, and store mean/std 14: -- per channel. These values are important, as they are part of 15: -- the trainable parameters. At test time, test data will be normalized 16: -- using these values. 17: 18: print '==> preprocessing data: normalize each feature (channel) globally' 19: mean = {} 20: std = {} 21: for i,channel in ipairs(channels) do 22: -- normalize each channel globally: 23: mean[i] = trainData.data[{ {},i,{},{} }]:mean() 24: std[i] = trainData.data[{ {},i,{},{} }]:std() 25: trainData.data[{ {},i,{},{} }]:add(-mean[i]) 26: trainData.data[{ {},i,{},{} }]:div(std[i]) 27: end 28: 29: -- Normalize test data, using the training means/stds 30: for i,channel in ipairs(channels) do 31: -- normalize each channel globally: 32: testData.data[{ {},i,{},{} }]:add(-mean[i]) 33: testData.data[{ {},i,{},{} }]:div(std[i]) 34: end 35: 36: -- Local normalization 37: print '==> preprocessing data: normalize Y (luminance) channel locally' 38: 39: -- Define the normalization neighborhood: 40: neighborhood = image.gaussian1D(7) 41: 42: -- Define our local normalization operator (It is an actual nn module, 43: -- which could be inserted into a trainable model): 44: normalization = nn.SpatialContrastiveNormalization(1, neighborhood):float() 45: 46: -- Normalize all Y channels locally: 47: for i = 1,trainData:size() do 48: trainData.data[{ i,{1},{},{} }] = normalization(trainData.data[{ i,{1},{},{} }]) 49: end 50: for i = 1,testData:size() do 51: testData.data[{ i,{1},{},{} }] = normalization(testData.data[{ i,{1},{},{} }]) 52: end At this stage, it's good practice to verify that data is properly normalized: 1: for i,channel in ipairs(channels) do 2: trainMean = trainData.data[{ {},i }]:mean() 3: trainStd = trainData.data[{ {},i }]:std() 4: 5: testMean = testData.data[{ {},i }]:mean() 6: testStd = testData.data[{ {},i }]:std() 7: 8: print('training data, '..channel..'-channel, mean: ' .. trainMean) 9: print('training data, '..channel..'-channel, standard deviation: ' .. trainStd) 10: 11: print('test data, '..channel..'-channel, mean: ' .. testMean) 12: print('test data, '..channel..'-channel, standard deviation: ' .. testStd) 13: end We can then get an idea of how the preprocessing transformed the data by displaying it: 1: -- Visualization is quite easy, using image.display(). Check out: 2: -- help(image.display), for more info about options. 3: 4: first256Samples_y = trainData.data[{ {1,256},1 }] 5: first256Samples_u = trainData.data[{ {1,256},2 }] 6: first256Samples_v = trainData.data[{ {1,256},3 }] 7: itorch.image(first256Samples_y) 8: itorch.image(first256Samples_u) 9: itorch.image(first256Samples_v) ## Exercise:This is not the only kind of normalization! Data can be normalized in different manners, for instance, by normalizing individual features across the dataset (in this case, the pixels). Try these different normalizations, and see the impact they have on the training convergence. tutorial_supervised_1_data.txt · Last modified: 2015/01/28 18:14 by clement ## Step 2: Model DefinitionThe code for this section is in th -i 2_model.lua -model linear th -i 2_model.lua -model mlp th -i 2_model.lua -model convnet In this file, we describe three different models: convolutional neural networks (CNNs, or ConvNets), multi-layer neural networks (MLPs), and a simple linear model (which becomes a logistic regression if used with a negative log-likelihood loss). Linear regression is the simplest type of model. It is parametrized by a weight matrix W, and a bias vector b. Mathematically, it can be written as: yn=Wxn+b Using the model = nn.Sequential() model:add(nn.Reshape(ninputs)) model:add( nn.Linear(ninputs, noutputs) ) A slightly more complicated model is the multi-layer neural network (MLP). This model is parametrized by two weight matrices, and two bias vectors: yn=W2sigmoid(W1xn+b1)+b2 where the function model = nn.Sequential() model:add(nn.Reshape(ninputs)) model:add(nn.Linear(ninputs,nhiddens)) model:add(nn.Tanh()) model:add(nn.Linear(nhiddens,noutputs)) Compared to the linear regression model, the 2-layer neural network can learn arbitrary non-linear mappings between its inputs and outputs. In practice, it can be quite hard to train fully-connected MLPs to classify natural images. Convolutional Networks are a particular form of MLP, which was tailored to efficiently learn to classify images. Convolutional Networks are trainable architectures composed of multiple stages. The input and output of each stage are sets of arrays called feature maps. For example, if the input is a color image, each feature map would be a 2D array containing a color channel of the input image (for an audio input each feature map would be a 1D array, and for a video or volumetric image, it would be a 3D array). At the output, each feature map represents a particular feature extracted at all locations on the input. Each stage is composed of three layers: a filter bank layer, a non-linearity layer, and a feature pooling layer. A typical ConvNet is composed of one, two or three such 3-layer stages, followed by a classification module. Each layer type is now described for the case of image recognition. Trainable hierarchical vision models, and more generally image processing algorithms are usually expressed as sequences of operations or transformations. They can be well described by a modular approach, in which each module processes an input image bank and produces a new bank. The figure above is a nice graphical illustration of this approach. Each module requires the previous bank to be fully (or at least partially) available before computing its output. This causality prevents simple parallelism to be implemented across modules. However parallelism can easily be introduced within a module, and at several levels, depending on the kind of underlying operations. These forms of parallelism are exploited in Torch7. Typical ConvNets rely on a few basic modules: Filter bank layer: the input is a 3D array with n1 2D feature maps of size n2 x n3. Each component is denoted xijk, and each feature map is denoted xi. The output is also a 3D array, y composed of m1 feature maps of size m2 x m3. A trainable filter (kernel) kij in the filter bank has size l1 x l2 and connects input feature map x to output feature map yj. The module computes yj=bj+ikij∗xi where ∗ is the 2D discrete convolution operator and bj is a trainable bias parameter. Each filter detects a particular feature at every location on the input. Hence spatially translating the input of a feature detection layer will translate the output but leave it otherwise unchanged. Non-Linearity Layer: In traditional ConvNets this simply consists in a pointwise tanh() sigmoid function applied to each site (ijk). However, recent implementations have used more sophisticated non-linearities. A useful one for natural image recognition is the rectified sigmoid Rabs: abs(tanh(gi) where gi is a trainable gain parameter. The rectified sigmoid is sometimes followed by a subtractive and divisive local normalization N, which enforces local competition between adjacent features in a feature map, and between features at the same spatial location. Feature Pooling Layer: This layer treats each feature map separately. In its simplest instance, it computes the average values over a neighborhood in each feature map. Recent work has shown that more selective poolings, based on the LP-norm, tend to work best, with P=2, or P=inf (also known as max pooling). The neighborhoods are stepped by a stride larger than 1 (but smaller than or equal the pooling neighborhood). This results in a reduced-resolution output feature map which is robust to small variations in the location of features in the previous layer. The average operation is sometimes replaced by a max PM. Traditional ConvNets use a pointwise tanh() after the pooling layer, but more recent models do not. Some ConvNets dispense with the separate pooling layer entirely, but use strides larger than one in the filter bank layer to reduce the resolution. In some recent versions of ConvNets, the pooling also pools similar feature at the same location, in addition to the same feature at nearby locations.
Here is an example of ConvNet that we will use in this tutorial: 1: -- parameters 2: nstates = {16,256,128} 3: fanin = {1,4} 4: filtsize = 5 5: poolsize = 2 6: normkernel = image.gaussian1D(7) 7: 8: -- Container: 9: model = nn.Sequential() 10: 11: -- stage 1 : filter bank -> squashing -> L2 pooling -> normalization 12: model:add(nn.SpatialConvolutionMap(nn.tables.random(nfeats, nstates[1], fanin[1]), filtsize, filtsize)) 13: model:add(nn.Tanh()) 14: model:add(nn.SpatialLPPooling(nstates[1],2,poolsize,poolsize,poolsize,poolsize)) 15: model:add(nn.SpatialSubtractiveNormalization(16, normkernel)) 16: 17: -- stage 2 : filter bank -> squashing -> L2 pooling -> normalization 18: model:add(nn.SpatialConvolutionMap(nn.tables.random(nstates[1], nstates[2], fanin[2]), filtsize, filtsize)) 19: model:add(nn.Tanh()) 20: model:add(nn.SpatialLPPooling(nstates[2],2,poolsize,poolsize,poolsize,poolsize)) 21: model:add(nn.SpatialSubtractiveNormalization(nstates[2], normkernel)) 22: 23: -- stage 3 : standard 2-layer neural network 24: model:add(nn.Reshape(nstates[2]*filtsize*filtsize)) 25: model:add(nn.Linear(nstates[2]*filtsize*filtsize, nstates[3])) 26: model:add(nn.Tanh()) 27: model:add(nn.Linear(nstates[3], noutputs)) A couple of comments about this model: the input has 3 feature maps, each 32x32 pixels. It is the convention for all nn.Spatial* layers to work on 3D arrays, with the first dimension indexing different features (here normalized YUV), and the next two dimensions indexing the height and width of the image/map. the first layer applies 16 filters to a the input map (choosing randomly among its different layers [see `fanin` parameter]), each being 5x5. The receptive field of this first layer is 5x5, and the maps produced by it are therefore 16x28x28. This linear transform is then followed by a non-linearity (tanh), and an L2-pooling function, which pools regions of size 2x2, and uses a stride of 2x2. The result of that operation is a 16x14x14 array, which represents a 14x14 map of 16-dimensional feature vectors. The receptive field of each unit at this stage is 7x7.the second layer is very much analogous to the first, except that now the 16-dim feature maps are projected into 256-dim maps, with a fully-connected connection table: each unit in the output array is influenced by a 4x5x5 neighborhood of features in the previous layer. That layer has therefore 4x256x5x5 trainable kernel weights (and 256 biases). The result of the complete layer (conv+pooling) is a 256x5x5 array. at this stage, the 5x5 array of 256-dimensional feature vectors is flattened into a 6400-dimensional vector, which we feed to a two-layer neural net. The final prediction (10-dimensional distribution over classes) is influenced by a 32x32 neighborhood of input variables (YUV pixels). recent work (Jarret et al.) has demonstrated the advantage of locally normalizing sets of internal features, at each stage of the model. The use of smoother pooling functions, such as the L2 norm for instance instead of the harsher max-pooling, has also been shown to yield better generalization (Sermanet et al.). We use these two ingredients in this model. one other remark: it is typically not a good idea to use fully connected layers, in internal layers. In general, favoring large numbers of features (over-completeness) over density of connections helps achieve better results (empirical evidence of this was reported in several papers, as in Hadsell et al.). The SpatialConvolutionMap module accepts tables of connectivities (maps) that allows one to create arbitrarily sparse connections between two layers. A couple of standard maps/tables are provided in nn.tables.
## Exercises:The number of meta-parameters to adjust can be daunting at first. Try to get a feeling of the inlfuence of these parameters on the learning convergence: going from the MLP to a ConvNet of similar size (you will need to think a little bit about the equivalence between the ConvNet states and the MLP states) replacing the 2-layer MLP on top of the ConvNet by a simpler linear classifier replacing the L2-pooling function by a max-pooling replacing the two-layer ConvNet by a single layer ConvNet with a much larger pooling area (to conserve the size of the receptive field)
tutorial_supervised_2_model.txt · Last modified: 2014/02/05 00:17 by clement ## Step 3: Loss FunctionNow that we have a model, we need to define a loss function to be minimized, across the entire training set: $$ L = \sum_n l(y^n,t^n) $$ One of the simplest loss functions we can minimize is the mean-square error between the predictions (outputs of the model), and the groundtruth labels, across the entire dataset: $$ l(y^n,t^n) = \frac{1}{2} \sum_i (y_i^n - t_i^n)^2 $$ or, in Torch: criterion = nn.MSECriterion() The MSE loss is typically not a good one for classification, as it forces the model to exactly predict the values imposed by the targets (labels). Instead, a more commonly used, probabilistic objective is the negative log-likelihood. To minimize a negative log-likelihood, we first need to turn the predictions of our models into properly normalized log-probabilities. For the linear model, this is achieved by feeding the output units into a $$ P(Y=i|x^n,W,b) = \text{softmax}(Wx^n+be) = \frac{ e^{Wx_i^n+b} }{ \sum_j e^{Wx_j^n+b} } $$ As we're interested in classification, the final prediction is then achieved by taking the argmax of this distribution: $$ y^n = \arg\max_i P(Y=i|x^n,W,b) $$ in which case the ouput y is a scalar. More generally, the output of any model can be turned into normalized log-probabilities, by stacking a model:add( nn.LogSoftMax() ) We want to maximize the likelihood of the correct (target) class, for each sample in the dataset. This is equivalent to minimizing the negative log-likelihood (NLL), or minimizing the cross-entropy between the predictions of our model and the targets (training data). Mathematically, the per-sample loss can be defined as: $$ l(x^n,t^n) = -\log(P(Y=t^n|x^n,W,b)) $$ Given that our model already produces log-probabilities (thanks to the criterion = nn.ClassNLLCriterion() Finally, another type of classification loss is the multi-class margin loss, which is closer to the well-known SVM loss. This loss function doesn't require normalized outputs, and can be implemented like this: criterion = nn.MultiMarginCriterion() The margin loss typically works on par with the negative log-likelihood. I haven't tested this thoroughly, so it's time for more exercises. ## Exercises:The obvious exercise now is to play with these different loss functions, and see how they affect convergence. In particular try to: - swap the loss from NLL to MultiMargin, and if it doesn't work as well, thinkg a little bit more about the scaling of the gradients, and whether you should rescale the learning rate.
tutorial_supervised_3_loss.txt · Last modified: 2012/10/02 13:28 (external edit) ## Step 4: Training ProcedureWe now have some training data, a model to train, and a loss function to minimize. We define a training procedure, which you will find in this file: A very important aspect about supervised training of non-linear models (ConvNets and MLPs) is the fact that the optimization problem is not convex anymore. This reinforces the need for a stochastic estimation of gradients, which have shown to produce much better generalization results for several problems. In this example, we show how the optimization algorithm can be easily set to either L-BFGS, CG, SGD or ASGD. In practice, it's very important to start with a few epochs of pure SGD, before switching to L-BFGS or ASGD (if switching at all). The intuition for that is related to the non-convex nature of the problem: at the very beginning of training (random initialization), the landscape might be highly non-convex, and no assumption should be made about the shape of the energy function. Often, SGD is the best we can do. Later on, batch methods (L-BFGS, CG) can be used more safely. Interestingly, in the case of large convex problems, stochasticity is also very important, as it allows much faster (rough) convergence. Several works have explored these techniques, in particular, this recent paper from Byrd/Nocedal, and work on pure stochastic gradient descent by Bottou. Here is our full training function, which demonstrates that you can switch the optimization you're using at runtime (if you want to), and also modify the batch size you're using at run time. You can do all these things because we create the evaluation closure each time we create a new batch. If the batch size is 1, then the method is purely stochastic. If the batch size is set to the complete dataset, then the method is a pure batch method. 1: -- classes 2: classes = {'1','2','3','4','5','6','7','8','9','0'} 3: 4: -- This matrix records the current confusion across classes 5: confusion = optim.ConfusionMatrix(classes) 6: 7: -- Log results to files 8: trainLogger = optim.Logger(paths.concat(opt.save, 'train.log')) 9: testLogger = optim.Logger(paths.concat(opt.save, 'test.log')) 10: 11: -- Retrieve parameters and gradients: 12: -- this extracts and flattens all the trainable parameters of the mode 13: -- into a 1-dim vector 14: if model then 15: parameters,gradParameters = model:getParameters() 16: end 17: 18: -- Training function 19: function train() 20: 21: -- epoch tracker 22: epoch = epoch or 1 23: 24: -- local vars 25: local time = sys.clock() 26: 27: -- shuffle at each epoch 28: shuffle = torch.randperm(trsize) 29: 30: -- do one epoch 31: print('==> doing epoch on training data:') 32: print("==> online epoch # " .. epoch .. ' [batchSize = ' .. opt.batchSize .. ']') 33: for t = 1,trainData:size(),opt.batchSize do 34: -- disp progress 35: xlua.progress(t, trainData:size()) 36: 37: -- create mini batch 38: local inputs = {} 39: local targets = {} 40: for i = t,math.min(t+opt.batchSize-1,trainData:size()) do 41: -- load new sample 42: local input = trainData.data[shuffle[i]]:double() 43: local target = trainData.labels[shuffle[i]] 44: table.insert(inputs, input) 45: table.insert(targets, target) 46: end 47: 48: -- create closure to evaluate f(X) and df/dX 49: local feval = function(x) 50: -- get new parameters 51: if x ~= parameters then 52: parameters:copy(x) 53: end 54: 55: -- reset gradients 56: gradParameters:zero() 57: 58: -- f is the average of all criterions 59: local f = 0 60: 61: -- evaluate function for complete mini batch 62: for i = 1,#inputs do 63: -- estimate f 64: local output = model:forward(inputs[i]) 65: local err = criterion:forward(output, targets[i]) 66: f = f + err 67: 68: -- estimate df/dW 69: local df_do = criterion:backward(output, targets[i]) 70: model:backward(inputs[i], df_do) 71: 72: -- update confusion 73: confusion:add(output, targets[i]) 74: end 75: 76: -- normalize gradients and f(X) 77: gradParameters:div(#inputs) 78: f = f/#inputs 79: 80: -- return f and df/dX 81: return f,gradParameters 82: end 83: 84: -- optimize on current mini-batch 85: if opt.optimization == 'CG' then 86: config = config or {maxIter = opt.maxIter} 87: optim.cg(feval, parameters, config) 88: 89: elseif opt.optimization == 'LBFGS' then 90: config = config or {learningRate = opt.learningRate, 91: maxIter = opt.maxIter, 92: nCorrection = 10} 93: optim.lbfgs(feval, parameters, config) 94: 95: elseif opt.optimization == 'SGD' then 96: config = config or {learningRate = opt.learningRate, 97: weightDecay = opt.weightDecay, 98: momentum = opt.momentum, 99: learningRateDecay = 5e-7} 100: optim.sgd(feval, parameters, config) 101: 102: elseif opt.optimization == 'ASGD' then 103: config = config or {eta0 = opt.learningRate, 104: t0 = trsize * opt.t0} 105: _,_,average = optim.asgd(feval, parameters, config) 106: 107: else 108: error('unknown optimization method') 109: end 110: end 111: 112: -- time taken 113: time = sys.clock() - time 114: time = time / trainData:size() 115: print("==> time to learn 1 sample = " .. (time*1000) .. 'ms') 116: 117: -- print confusion matrix 118: print(confusion) 119: confusion:zero() 120: 121: -- update logger/plot 122: trainLogger:add{['% mean class accuracy (train set)'] = confusion.totalValid * 100} 123: if opt.plot then 124: trainLogger:style{['% mean class accuracy (train set)'] = '-'} 125: trainLogger:plot() 126: end 127: 128: -- save/log current net 129: local filename = paths.concat(opt.save, 'model.net') 130: os.execute('mkdir -p ' .. sys.dirname(filename)) 131: print('==> saving model to '..filename) 132: torch.save(filename, model) 133: 134: -- next epoch 135: epoch = epoch + 1 136: end We could then run the training procedure like this: while true train() end ## Exercices:So, a bit on purpose, I've given you this blob of training code with rather few explanations. Try to understand what's going on, to do the following things: modify the batch size (and possibly the learning rate) and observe the impact on training accuracy, and test accuracy (generalization) change the optimization method, and in particular, try to start with L-BFGS from the very first epoch. What happens then?
tutorial_supervised_4_train.txt · Last modified: 2012/10/02 13:28 (external edit) ## Step 5: Test the ModelA common thing to do is to test the model's performance while we train it. Usually, this test is done on a subset of the training data, that is kept for validation. Here we simply define the test procedure on the available test set: 1: function test() 2: -- local vars 3: local time = sys.clock() 4: 5: -- averaged param use? 6: if average then 7: cachedparams = parameters:clone() 8: parameters:copy(average) 9: end 10: 11: -- test over test data 12: print('==> testing on test set:') 13: for t = 1,testData:size() do 14: -- disp progress 15: xlua.progress(t, testData:size()) 16: 17: -- get new sample 18: local input = testData.data[t]:double() 19: local target = testData.labels[t] 20: 21: -- test sample 22: local pred = model:forward(input) 23: confusion:add(pred, target) 24: end 25: 26: -- timing 27: time = sys.clock() - time 28: time = time / testData:size() 29: print("==> time to test 1 sample = " .. (time*1000) .. 'ms') 30: 31: -- print confusion matrix 32: print(confusion) 33: confusion:zero() 34: 35: -- update log/plot 36: testLogger:add{['% mean class accuracy (test set)'] = confusion.totalValid * 100} 37: if opt.plot then 38: testLogger:style{['% mean class accuracy (test set)'] = '-'} 39: testLogger:plot() 40: end 41: 42: -- averaged param use? 43: if average then 44: -- restore parameters 45: parameters:copy(cachedparams) 46: end 47: end The train/test procedure now looks like this: while true train() test() end ## Exercices:As mentionned above, validation is the proper (an only!) way to train a model and estimate how well it does on unseen data: modify the code above to extract a subset of the training data to use for validation once you have that, add a stopping condition to the script, such that it terminates once the validation error starts rising above a certain threshold. This is called early-stopping.
## All Done!The final step of course, is to run ## Final ExerciseIf time allows, you can try to replace this dataset by other datasets, such as MNIST, which you should already have working (from day 1). Try to think about what you have to change/adapt to work with other types of images (non RGB, binary, infrared?). tutorial_supervised_5_test.txt · Last modified: 2012/10/02 13:28 (external edit) ## Tips, going futher## Tips and tricks for MLP trainingThere are several hyper-parameters in the above code, which are not (and, generally speaking, cannot be) optimized by gradient descent. The design of outer-loop algorithms for optimizing them is a topic of ongoing research. Over the last 25 years, researchers have devised various rules of thumb for choosing them. A very good overview of these tricks can be found in Efficient BackProp by Yann LeCun, Leon Bottou, Genevieve Orr, and Klaus-Robert Mueller. Here, we summarize the same issues, with an emphasis on the parameters and techniques that we actually used in our code. ## Tips and Tricks: NonlinearityWhich non-linear activation function should you use in a neural network? Two of the most common ones are the logistic sigmoid and the tanh functions. For reasons explained in Section 4.4, nonlinearities that are symmetric around the origin are preferred because they tend to produce zero-mean inputs to the next layer (which is a desirable property). Empirically, we have observed that the tanh has better convergence properties. ## Tips and Tricks: Weight initializationAt initialization we want the weights to be small enough around the origin so that the activation function operates near its linear regime, where gradients are the largest. Otherwise, the gradient signal used for learning is attenuated by each layer as it is propagated from the classifier towards the inputs. Proper weight initialization is implemented in all the modules provided in ## Tips and Tricks: Learning RateOptimization by stochastic gradient descent is very sensitive to the step size or Section 4.7 details procedures for choosing a learning rate for each parameter (weight) in our network and for choosing them adaptively based on the error of the classifier. ## Tips and Tricks: Number of hidden unitsThe number of hidden units that gives best results is dataset-dependent. Generally speaking, the more complicated the input distribution is, the more capacity the network will require to model it, and so the larger the number of hidden units that will be needed. ## Tips and Tricks: Norm RegularizationTypical values to try for the L1/L2 regularization parameter are 10^-2 or 10^-3. It is usually only useful to regularize the topmost layers of the MLP (closest to the classifier), if not the classifier only. An L2 regularization is really easy to implement, -- model: model = nn.Sequential() model:add( nn.Linear(100,200) ) model:add( nn.Tanh() ) model:add( nn.Linear(200,10) ) -- weights to regularize: reg = {} reg[1] = model:get(3).weight reg[2] = model:get(3).bias -- optimization: while true do -- ... optim.sgd(...) -- after each optimization step (gradient descent), regularize weights for _,w in ipairs(reg) do w:add(-weightDecay, w) end end ## Tips and tricks for ConvNet trainingConvNets are especially tricky to train, as they add even more hyper-parameters than a standard MLP. While the usual rules of thumb for learning rates and regularization constants still apply, the following should be kept in mind when optimizing ConvNets. ## Number of filtersSince feature map size decreases with depth, layers near the input layer will tend to have fewer filters while layers higher up can have much more. In fact, to equalize computation at each layer, the product of the number of features and the number of pixel positions is typically picked to be roughly constant across layers. To preserve the information about the input would require keeping the total number of activations (number of feature maps times number of pixel positions) to be non-decreasing from one layer to the next (of course we could hope to get away with less when we are doing supervised learning). The number of feature maps directly controls capacity and so that depends on the number of available examples and the complexity of the task. ## Filter ShapeCommon filter shapes found in the literature vary greatly, usually based on the dataset. Best results on MNIST-sized images (28x28) are usually in the 5x5 range on the first layer, while natural image datasets (often with hundreds of pixels in each dimension) tend to use larger first-layer filters of shape 7x7 to 12x12. The trick is thus to find the right level of "granularity" (i.e. filter shapes) in order to create abstractions at the proper scale, given a particular dataset. It's also possible to use multiscale receptive fields, to allow the ConvNet to have a much larger receptive field, yet keeping its computational complexity low. This type of procedure was proposed for scene parsing (where context is crucial to recognize objects) in this paper. ## Pooling ShapeTypical values for pooling are 2x2. Very large input images may warrant 4x4 pooling in the lower-layers. Keep in mind however, that this will reduce the dimension of the signal by a factor of 16, and may result in throwing away too much information. In general, the pooling region is independent from the stride at which you discard information. In Torch, all the pooling modules (L2, average, max) have separate parameters for the pooling size and the strides, for example: nn.SpatialMaxPooling(pool_x, pool_y, stride_x, stride_y) tutorial_supervised_a_tips.txt · Last modified: 2012/10/02 13:28 (external edit) |

Artificial Intelligence + NLP + deep learning > AI > Machine Learning > Neural Networks > Deep Learning > Torch > Madbits Tutorial >