Trang chủ‎ > ‎IT‎ > ‎DEEP LEARNING‎ > ‎

By Hamid Palangi

### Associate Researcher II at Microsoft Research

Two weeks ago I attended the deep learning summer school at Montreal organized by Yoshua Bengio and Aaron Courville. Below is a summary of what I learned. It starts from basic concepts and continues with more advanced topics.

### 1. Essence of regularization

Two popular regularizations that are used in machine learning / deep learning are L2 (keeps L2 norm of the weights bounded, results in non-sparse set of weights, i.e., the weight of irrelevant features are small but NOT zero) and L1 (results in sparse set of weights, computationally more expensive than L2). They help to adjust the hypothesis complexity, e.g., if hypothesis has high variance (overfitting), they can help to alleviate the problem. From a Bayesian point of view, L2 regularization is equivalent to a circular Gaussian prior for weights. L1 regularization is equivalent to a double exponential prior. Note that the regularization is only applied on weights NOT biases. Other popular regularization techniques that help better generalization are dropout [Hinton et al, JMLR 2014], or using unsupervised training for initialization of supervised training, e.g., using RBMs to initialize autoencoder's weights as explained in [Hinton & Salakhutdinov, Science 2006]. Usually in practice, using a large model with regularization (e.g., injecting noise) works better than using a small fully parametric model without regularization.

### 2. Why do we need more than one neuron?

A single Neuron can only solve a linear separable problem, e.g., AND operation. It can not solve a non-linear separable problem, e.g., XOR operation. Nevertheless, if we use a better representation of the input data it can solve the XOR operation. For example, by using a non-linear transformation of the input data, if inputs are x1 and x2, XOR( y1, y2) can be done using a single neuron if y1 = AND(NOT(x1),x2) and y2 = AND(x1,NOT(x2)).

### 3. What non-linearity to choose for neurons?

The rule of thumb to select non-linearity is to always start with ReLU (Rectified Linear Unit). It leads to less computational complexity for backpropagation and usually results in sparse activations for neurons. The non-differentiable point at 0 in ReLU is not a problem (sub-gradients can address this problem). Question: Is it a good idea to use different non-linearities in different layers? No success yet. Except if we want to put some structure in the output, e.g., the attention mechanism.

### 4. Practical tips to train a neural network

• Initialization: To break symmetry we use random initialization, for example see [Glorot & Bengio, 2010].
• Hyper-parameter selection: (a): Using grid search, i.e., trying all possible configurations of hyper-parameters. This is computationally expensive. (b): Using random search [Bergstra & Bengio, 2012], i.e., specify a distribution over the values of each hyper-parameter and then sampling from each of them independently. (c): Bayesian optimization [Snoek, et al, NIPS 2012] which requires less number of guesses to get hyper-parameters.
• Early stopping: Since it has zero cost, it is better to always do it.
• Validation set choice: This can become very important. The validation set size should be large enough so that the model does not overfit on the validation set. This type of overfitting also depends on how many validation tests we run on the validation set.
• Normalization: For real valued data, normalization speeds up the training.
• Learning rate: Starting with a large learning rate and then decaying it or using methods with adaptive learning rates like Adagrad, RMSprop or Adam.
• Gradient check: Very helpful for debugging the implementation of backprop. We simply compare the gradient with a finite difference approximation of it. Question: Can the finite difference approximation of the gradient replace backprop? No, because it is less numerically stable.
• Always make sure the model overfits on a small dataset.
• What to do if training is hard?: First, make sure backpropagation implementation is not buggy and the learning rate is not too large. Then, If it is underfitting, use better optimization methods, larger models, etc . If it is overfitting, use better regularization, e.g., unsupervised initialization, dropout, etc.
• Batch Normalization [Loffe & Szegedy, JMLR 2015]: Very helpful technique, which shows that the normalization at higher layers further improves the performance. It can be done in 4 steps: (a): Doing normalization for each hidden layer before applying non-linearity. (b): During training, mean and standard deviation are computed for each minibatch. (c): During backpropagation, we should take into account the normalization during forward pass. In other words, a scale and shift operation should be performed during backpropagation. Scale and shift parameters should also be learned because derivative with respect to hidden layers will also depend on them. (d): At the test time, global mean and standard deviation is used NOT the ones calculated for each minibatch.

### 5. How important is depth?

Nicely explained by Rob Fergus. We can investigate the importance of depth by inspecting different parts of Krizhevsky's Convolutional Neural Network (CNN) which has 8 layers and is trained on ImageNet. The architecture of Krizhevsky's CNN [Krizhevsky et al, NIPS 2012] along with the results of applying SVM on different layers are shown below [picture from Rob Fergus presentation]:

Another important observation is that if we remove layers 3, 4 (convolutional layers) and 6, 7 (fully connected layers), the performance drops 33.5%.

It is important to note that simply adding many more layers does not always improve the performance. For example, results of simply using 20 layers and 56 layers of CIFAR-10 are shown below [picture from He et al, CVPR 2016]:

Similar phenomena has been observed on ImageNet which means that learning better models is not always equivalent to adding more layers. Note that above problem is NOT caused by overfitting as it is obvious from training error curves above. One reason might be the fact that with deeper networks the error signal during backpropagation is not significant enough when it arrives at lower layers. To resolve this problem, residual network is proposed in [He et al, CVPR 2016] which simply adds skip connections in CNN architecture. One example is shown below [picture from He et al, CVPR 2016]:

Note that the skip connection is applied before the non-linear activation function.

### 6. Which one is more important, designing a better feature extractor below, or, designing a better classifier on the top?

Using a powerful feature extractor (e.g., a CNN or deep residual network for vision tasks) is far more important than designing the classifier on the top.

### 7. Evolution of image databases to big data

Below is a summary of image databases from 1970 till now [picture from Antonio Torralba presentation]:

### 8. Convolutional Generative Adversarial Networks

Assume that we want to find a generative model that can generate data similar to the samples that we have in our dataset. For example, we want to build a generative model that can generate images similar to those in MNIST or CIFAR dataset. Generally, this is a very difficult task because of many intractable probabilistic computations involved in maximum likelihood or other related methods for this task. One elegant idea for this task is Generative Adversarial Networks (GANs) proposed by [Goodfellow et al, NIPS 2014]. In GANs, two models are simultaneously trained, a generative model (G) and a discriminative model (D). G generates an image, and D is a binary classifier that classifies the given image to be a sample from dataset (true data), or a sample generated by G (artificially generated data). G is trained to maximize the probability that D makes a mistake (min-max two player game). As a result, after training, G estimates the distribution of the data. Some sample images generated by G for MNIST and CIFAR-10 from [Goodfellow et al, NIPS 2014] are represented below (picture from [Goodfellow et al, NIPS 2014]):

In [Radford et al, ICLR 2016] a form of Convolutional Network is proposed which is more stable with adversarial training than other methods. Other related references for GANs are "Adversarial examples in the physical world (http://arxiv.org/abs/1607.02533)", "Improved techniques for training GANs (http://arxiv.org/abs/1606.03498)", "Virtual adversarial training for semi-supervised text classification (http://arxiv.org/abs/1605.07725)". They have even been used to generate new Pokemon GO species! (https://www.youtube.com/watch?v=rs3aI7bACGc).

### 9. Which deep learning toolkit to use?

There is no silver bullet! It depends on the target task and application. Below is a comparison from Alex Wiltschko presentation

There is also a great comparison among Caffe, CNTK, TensorFlow, Theano and Torch with much more details in this post by Kenneth Tran.

### 10. What are the new advances in recurrent neural networks research?

Recurrent neural networks (mainly LSTMs and GRUs) have been significantly successful recently mainly used for converting sequence to vector (e.g., Sentence Embedding [Palangi et al, 2015]), sequence to sequence (e.g., Machine Translation [Sutskever et al, 2014][Bahdanau et al, 2014]) and vector to sequence (e.g., Image Captioning [Vinyals et al, 2014]). Vanilla RNNs have not been as successful to capture long term dependencies due to vanishing/exploding gradient problems. Nevertheless, in the limit of infinite time training (which is not practical), vanilla RNN will eventually learn long term dependencies. Below are a list of recent works related to RNNs which got my attention during Yoshua Bengio's presentation about RNNs:

(a): Assume that we want to train a neural language model using LSTM. The basic task is to predict the next word given previous words for which we minimize the perplexity as cost function. During training, we give all "true" previous words to the model and use them to predict the next word. But during inference, we give all "predicted" previous words to the model and use them to predict the next word. To resolve this incompatibility between training and inference, a method is proposed in [Bengio et al, 2015] where during training, a weak supervision from previously generated words by the model is also used. This results in significant performance improvement.

(b): Multiplicative integration with RNNs proposed in [Wu et al, 2016]. The main idea is to replace the summation with Hadamard product in RNNs. This simple modification results in significant performance improvement presented in above reference.

(c): How to understand and measure the architectural complexity of a given RNN model? In [Zhang et al, 2016], three measures are proposed which are: (c.1): recurrent depth (length of longest path divided by sequence length), (c.2): feedforward depth (length of longest path from input to nearest output) and (c.3): skip coefficient (length of shortest path divided by sequence length).

(d): Pixel RNNs (ICML 2016 best paper award) [Oord et al, 2016]: This work proposes a method to model the probability distribution of a natural image. The main idea is to factorize the probability distribution of the input image into the product of conditional probabilities. To do this, a Diagonal BiLSTM unit is proposed that efficiently captures the entire available context (all the pixels above the current pixel) of the image (see Fig. 2 of the paper). Residual skip connections are also used in the architecture. It has resulted in state-of-the-art performance in terms of log-likelihood. Below are a number of natural images generated by the model trained on ImageNet [picture from Oord et al, 2016]:

### 11. Can all problems be mapped to y=f(x)?

No! Example tasks which the simple y=f(x) fails are: (a): cloze style QA where the task is to read and comprehend a text (e.g., book, etc) and then answer questions about it. (b): Given a text, the task is to fill in the blanks. (c): ChatBot.

As explained nicely in Sumit Chopra's presentation, the model needs to: (a): Remember the external context. (b): Given an input, the model needs to know where to look for in the context. (c): What to look for in the context. (d): How to reason, using this external context. (e): The model should also handle a changing external context.

Therefore, introducing a notion of memory to capture external context is important. One proposal is to use hidden states of RNNs as memory. For example, running an RNN on the context (book, text, etc) to get its representation, then, using this representation to map a question to answer. There are two problems with this approach: (a): It does not scale. (b) the idea that hidden states of an RNN are both the memory and the controller of the memory is not appropriate. We should separate these two.

The main idea of a memory network [Weston et al, 2015] is to separate the controller of the memory from the memory itself. In other words, it combines a large memory with a learning component that can read and write to the memory.

Memory networks perform better than LSTMs in QA task but the performance of both of them are close in language modelling task. One reason might be the fact that for language modelling task we do not need very long term dependencies compared to QA and dialogue related tasks. One shortcoming of current memory networks is that there is no memory compression. If the memory is full, they simply recycle.

### 12. Large scale deep learning with TensorFlow presented by Jeff Dean

Generally, the important features that are desirable in a machine learning system are (from Jeff Dean's presentation): (a): Ease of expression: for many machine learning algorithms. (b): Scalability: to be able to run experiments quickly. (c): Portability: so that we can run experiments on various platforms. (d): Reproducability: which helps to share and reproduce research. (e): Production readiness: from research to real products.

TensorFlow (TF) have been designed with careful consideration to above features. Other notes about TF are: (a): The core of TF is C++ which results in very low overhead. (b): TF system automatically decides which operations should be run on CPU or GPU. This usually helps to significantly improve the time of experiments. (c):  The first version of scalable deep learning system at Google, i.e., DistBelief [Dean et al, NIPS 2012] is not as flexible as TF for research purposes. DistBelief has separate parameter servers, i.e., separate code for parameter servers v.s. rest of the system, which results in a non-uniform and more complicated system. (d): TF session interface allows to "extend" which can be used to add nodes to the computation graph and "run" which in addition to running the full computation graph can also be used to run an arbitrary subgraph of the computation graph. (e): Question: How does TF make distributed training easy? It uses model parallelism (partitioning model across machines) and data parallelism. It is easy to express both types of parallelisms in TF with minimal changes to single device model code. (f): TF can take care of devices / graph placement. In other words, given a computation graph and a set of devices, TF allows the user to decide which device executes each node.

### 13. History of Statistical Language Modelling?

Statistical language modelling is all about how probable a sentence is. We generally maximize the log probabilities of sentences in the corpora. This, however, has not been obvious for everyone in 90s (review of Brown et al, 1990 paper) [from Kyunghyun Cho's presentation]:

which reads: "The validity of statistical (information theoretic) approach to MT has indeed been recognized ... as early as 1949. And was universally recognized as mistaken [sic] by 1950 ... The crude force of computers is not science."

### 14. What are the issues with non-parametric language modelling (e.g., n-grams)?

In n-gram language modelling, we basically collect n-gram statistics from a large corpus (i.e., counting). Some issues with this approach are: (a): False conditional independence assumption: because in an n-gram language model we assume that each word is only conditioned on the previous n-1 words. (b): Data sparsity: which means that if a co-occurrence of some words has never been observed in the training set, it will be assigned zero probability which results in the probability of whole sentence to be zero. Conventional solutions for this problem are smoothing and backoff. (c): Lack of generalization across domains.

As an example, an n-gram language model might fail in the sentence "The dogs chasing the cat bark". The tri-gram probability P(bark | the, cat) is very low (not observed in a natural language corpus by the model, because the cat never barks and the plural verb "bark" has appeared after singular noun "cat"), but the whole sentence totally makes sentence.

### 15. Parametric and Neural Language Modelling

The basic idea of a neural language model is to create continuous space word representations and use them for language modelling. For example, in [Bengio et al 2003], a feedforward neural network with a softmax layer on the top is used to for language modelling represented below (picture from Kyunghyun Cho's presentation):

A better choice for neural language modelling are RNNs (LSTMs, GRUs, ...) or Memory Networks which have resulted in state-of-the art performance in terms of perplexity. For example, see the paper "Exploring the Limits of Language Modelling" by Jozefowicz et al, 2016. A simple example of an unfolded vanilla RNN language model is represented below where the model reads the input word, updates the hidden states representations and predicts the next word (picture from Kyunghyun Cho's presentation):

### 16. Character-Level Neural Machine Translation

The task in machine translation is to generate a sentence in target language, given a sentence in source language. In Neural Machine Translation (NMT), an RNN (LSTM, GRU, etc) is used to encode the source sentence into a vector, and another RNN is used to decode the vector from encoder into a sequence of words in target language (sequence to sequence learning). This is shown in the following diagram (picture from Kyunghyun Cho's presentation):

Above model can be improved if we use an attention based decoder [Bahdanau et al, ICLR 2015]. The idea is to compute a set of attention weights and use weighted sum of encoder's annotation vectors in the decoder. This approach, allows decoder to automatically just focus on the parts of the source sentence that are relevant for predicting each target word. It is shown in the following diagram (picture from Kyunghyun Cho's presentation):

The main issue with above models is that they use words as basic units of language. For example, "run", "runs", "ran" and "running" are from one lexeme "run". But above models assign them four independent vectors. It is also not always easy to segment a sentence into words. The question is, can we use character level NMT to address above issues? In [Chung et al, 2016], it is shown that character level NMT works surprisingly well. It is also interesting to note that an RNN, implicitly segments a character sequence automatically. For example, see the demonstration below (from Kyunghyun Cho's presentation):

### 17. Why Generative Models?

Nicely explained in Shakir Mohamed's presentation, we need generative models for moving beyond associating inputs to outputs, semi-supervised classification, data manipulation, filling in the blank, inpainting, denoising, one-shot generalization [Rezende et al, ICML 2016] and many more applications. Progress in generative models is presented in the following diagram (note that the vertical axis should be negative log-likelihood) [from Shakir Mohamed's presentation]:

### 18. What are different types of generative models?

Generative models can be classified into three groups:

(a): Fully Observed Models: Model directly observes data without introducing any new unobserved local variable. These types of models can directly encode the relationship among observed points. For directed graphical models, it is easy to scale up to large models and the parameter learning is simple because log-likelihood can be computed directly (no need for approximation). For undirected models, the parameter learning is difficult as we need to compute normalization constants. Generation in fully observed models can be slow. Below diagram shows different fully observed generative models [from Shakir Mohamed's presentation]:

(b): Transformation Models: Model transforms an unobserved noise source using a parameterised function. It is easy to (1): sample from these models and (2): compute expectations without knowing the final distribution. They can be used with large scale classifiers and convolutional neural networks. Nevertheless, it is difficult to maintain invertibility and extend to generic data types using these models. Below diagram shows different transformation generative models [from Shakir Mohamed's presentation]:

(c): Latent Variable Models: In these models, an unobserved local random variable is introduced that represents hidden causes. It is easy to sample from these models and to include hierarchy and depth. It is also possible to do scoring and model selection using marginalized likelihood. Nevertheless, it is difficult to determine latent variables corresponding to an input. Below diagram shows different latent variable generative models [from Shakir Mohamed's presentation]:

Comments