Machine Learning, Neural Network & Deep Learning

Deep Learning

Notes

The complex multi-layer Artificial Neural Network (ANN) with two or more hidden layers is known as deep learning network, where the complex problem is hierarchically divided and sub-divided into smaller specific problems, and are implemented through ANN separately with the concept of layer abstraction. For example [1], the face detection problem is divided into sub-problems such as “is there an eye in top left”, “is there a nose in the middle, is there a hair on the top” etc which represent the sub-networks for the ANN for face detection as showed in Fig. 1. Each problem is further divided into sub-problems represented by sub-layers such as "is there an eyebrow" as depicted in Fig. 2.

Fig. 1: Neural Network decomposed into sub-networks to solve sub-problems [1]

Fig. 2: The sub-networks for the block "Is there an eye in the top left ?" [1]

Deep Learning network can definitely learn through stochastic gradient descent (SGD) algorithm as in Neural Network, however, the learning can be unacceptably slow irrespective of the learning rate especially when the output is widely deviated from the predicted value. The output lies closer to the flat portion of the curve of the transfer function. This learning slowdown can be addressed by replacing the quadratic cost function with cross-entropy function which penalizes outputs based on inaccuracy. So, the network learn faster when the error is big.

Deep Learning needs to perform well for new and exploratory data sets that can be significantly different than the training sets as they will be applied in to extremely complicated domains such as images, audio sequences, and texts. So, complex models needs to be regularized appropriately. Regularization techniques like cross-validation, Softwmax Layer, L1, L2 - regularization, early stopping, pruning, dropout, artificial expansion on training data (e.g. rotating digits), Bayesian priors, prevents over-fitting or over-training so that the neural network instead of memorizing training data can better adjust to the unseen or new data. In Fig. 3, though the training data (in dots) are fitted perfectly well with polynomial function (blue line), the linear function (black line) is expected to generalize better for new data sets i.e. it makes better predictions.  So, the polynomial model is over-fitting.

Fig. 3: Fitting data with polynomial and linear function [2]

In the Neural Network "XOR" example, the performance function for network is set to 'mse' (Mean Squared Error) that measures the performance according to quadratic cost function. To use cross-entropy function in DNN, set the function to 'crossentropy' and regularization parameter "λ" to values between 0 and 1. It is set to 0 by default.

net.performFcn = 'crossentropy';

net.performParam.regularization = 0.1;

It is difficult to determine optimum value for the λ. So, the automated regularization can be implemented using Bayesian method using training function "trainbr" in MATLAB.

net.trainFcn = 'trainbr'

or

net = feedforwardnet(3,'trainbr');

The trainbr algorithm generally works best when the network inputs and targets are scaled so that they fall approximately in the range [−1,1]. If the inputs and targets do not fall in this range, use the function mapminmax or mapstd to perform the scaling.

The Softmax layer of neurons is based on similar concept with cross-entropy. The output from softmax activation function can be thought as a probability distribution. In Matlab, it is implemented as:

net.layers{i}.transferFcn = 'softmax';

The Softmax layer can also get stacked in a network as [3]:

softnet = trainSoftmaxLayer(feat2,tTrain,'MaxEpochs',400);

deepnet = stack(autoenc1,autoenc2,softnet);

By dividing the training data into training, validation, and test using data division function, the error on the validation set is monitored during the training process. In xorNetwork.m, "dividernd" function is set as follows to randomly divides the sets into training, validation, and tests with the default ratio 0.7, 0.15, and 0.15 respectively.

 net.divideFcn = 'dividerand';

The validation error normally decreases during the initial phase of training, as does the training set error. However, when the network begins to overfit the data, the error on the validation set typically begins to rise. The default regularization method called early stopping stops the training after certain iterations (net.trainParam.max_fail) after analyzing the network performance after training. Set the epoch with the value at the point when the test curve increase significantly before the validation curve increases to avoid over-fitting.

plotperf(tr)

net.trainParam.max_fail = 5

By creating a dropout layer that randomly sets about 50% (default) of the inputs to zero during the training period as showed in Fig. 4, the over-fitting can be avoided. The matlab function to create dropout layer is:

droplayer = dropoutLayer()

Fig. 4: Dropout layer with dotted structures having inputs set to zero.

Let's create a Deep Neural Network (DNN) in MATLAB by stacking two auto-encoders (unsupervised) and softmax layer (supervised) as showed in Fig 5 to classify images of digits 0 to 9 [3] by training one layer at a time (see MATLAB implementation under the section "Using Software"). The script "myDNNScript.m" is available at /usr/local/doc/DEEPLEARNING/neural-newtwork/. The auto-encoders are used to replicate the input images to its output i.e. to extract the features. The encoder maps an input to a hidden representation, and the decoder attempts to reverse this mapping to reconstruct the original input.

Fig. 5: Deep Neural Network (DNN)

The final supervised training boost the overall accuracy of 83.1% to 99.6% as viewed in the form of confusion matrix (Fig. 6).

Fig. 6: Confusion Matrix

Neural Network/Deep Learning Software

Find the MATLAB GPU implementation of Convolution Neural Network (CNN) in [4] and python implementation in [5]. The other neural network packages listed in HPC Software Guide are Tensorflow, NumPy/SciPy, Torch, CaffeNeuron, and more. Find other packages and their pros and cons in [6].

Using Software

Copy the directory "neural-network" from usr/local/doc/DEEPLEARNING and cd to it.

cp -r /local/doc/DEEPLEARNING/neural-newtwork

cd neural-newtwork

Request a compute node

srun --x11 --pty bash

Matlab Implementation of Image Category Implementation [4]:

Load the matlab module

module load matlab

Run matlab script

matlab -r myDNNScript

You will see partial digital images, training of different layers, and Confusion Matrix plots before and after the training.

Python Implementation of CNN network for visual recognition [5]:

Load the python module

module load python

Run the script:

python visualRecgnition.py

See the visual display:

display spiral_net.png

Torch Implementation of LRCN

The LRCN (Long-term Recurrent Convolutional Networks) model proposed by Jeff Donahue et. al has been implemented as torch-lrcn [7] using Torch7 framework. The algorithm for sequential motion recognition consists convolution neural network (CNN) and long short-term memory (LSTM) network. We are trying to speed up the process of LRCN enabling gpu acceleration with CUDA using Kepler-40 available in CWRU HPC.

Copy the job file "job.slurm" from /usr/local/doc/TORCH to your home directory

cp /usr/local/doc/TORCH/torch-lrcn-master.tar.gz .

Untar the file and change directory to "torch-Lrcn-master"

tar xzvf torch-lrcn-master.tar.gz

cd torch-lrcn-master

Copy the job file "job.slurm" from /usr/local/doc/TORCH to your home directory

cp /usr/local/doc/TORCH/job.slurm .

In the torch script "train.lua", find the line "cmd:option('-cuda', 0)". For GPU implementation replace 0 with 1.

Submit the job

sbatch job.slurm

Check the execution time in the log file "TorchJob.o<JobID>"

3:21:22 Epoch 6 validation loss: nan

13:21:23 Saved checkpoint model and opt at checkpoints/checkpoint_6.t7

4

8

12

16

20

24

28

32

36

40

44

....

500

13:31:46 Epoch 30 training loss: 1.609733

13:31:46 Starting loss testing on the val split

13:31:46 Epoch 30 validation loss: nan

13:31:47 Saved checkpoint model and opt at checkpoints/checkpoint_final.t7

13:31:47 Finished training

real    16m41.871s

user    10m49.276s

sys     2m12.542s

Execution time without GPU:

real    131m9.290s

user    130m14.878s

sys     0m32.280s

Caffe Implementation - Training LeNet on MNIST

TensorFlow Implementation - Multilayer CNN [9]

FASTAI Implementation - FastAI Tutorial

Appendix

References:

[1] Neural Network & Deep Learning - http://neuralnetworksanddeeplearning.com/chap1.html

[2] Wikipedia - www.wikipedia.com

[3] Digital Classification - http://www.mathworks.com/help/nnet/examples/training-a-deep-neural-network-for-digit-classification.html

[4] Image Category Classification - https://www.mathworks.com/help/vision/examples/image-category-classification-using-deep-learning.html

[5] CNN for visual Recognition: http://cs231n.github.io/neural-networks-case-study/

[6] Pros and Cons of Deep Learning Packages: https://deeplearning4j.org/compare-dl4j-torch7-pylearn

[7] Torch implementation of LCRN

[8] Online Book - An Introduction to Statistical Learning

[9] Tensorflow - MNIST Tutorial