Deep Learning Reading List

Background: Why I make this list

I became interested in Deep Learning since this year (2014)'s ICML conference. Prior to that, I was a committed Bayesian, and was indifferent to the deep learning fad. While in ICML'14, I was impressed by the audience size of deep learning sessions. Later at one poster session, I asked Yoshua Bengio a few stupid questions in front of his poster. He answered seriously and patiently. I realized he, as one of the founders of this area, is a very serious scholar, which made me get serious about this field.

Like many other newbies in this field, I've been reading a pile of old papers and catching up with the new ones published every few days. I decided to write notes about new insights and understandings I acquire as I move ahead. I'll also comment on some models and methods. One reason for doing this is personal: I hope that one day when I look back, I can say, "wow, I've read so many papers", and easily track down to where my understandings come from. Another reason is, it's painful to learn a new field, and hopefully this list can help cut down a little bit of the learning curve of some people.

The List

(Disclaimer: it reflects my personal biases and is by no means comprehensive)

1. A Unified Energy-Based Framework for Unsupervised Learning, by Marc’Aurelio Ranzato, Y-Lan Boureau, Sumit Chopra and Yann LeCun. AISTATS'2007.

This paper is not dedicated to deep learning models, but the subject matter is about a more general problem of energy based unsupervised learning models: the landscape of the energy surface, which determines their generalization performance. As RBM and autoencoders are typical instances of energy based models, they are discussed in detail. So this paper also sheds light on deep learning.

A good model should have a sharp energy surface, which means it predicts positive examples (observed points in the paper) with high probabilities, and negative examples (unobserved points in the paper) with low probabilities. But without negative training examples (which is the typical situation for unsupervised learning), the learning algorithm can hardly train these models into such nice shapes. Some probabilistic models, such as GMM, naturally are less afflicted by this problem, as their mixture components are sharp distributions (e.g. Gaussians, where the probability drops quickly away from the mean). But typical deep learning models, which are based on linear energy functions and the softmax activation function, are less lucky.

As an aside, this principle reminds me the well known word embedding model: the skip-gram model with negative sampling (a.k.a. word2vec). It uses randomly generated word pairs as negative examples. This trick might be an important reason for the excellent performance of word2vec.

The loss function determines what the energy shape would be like. This paper analyzes different loss functions in various models and explains why this or that shape is finally achieved.

Besides, by imposing sparsity constraints, we can make the learned energy landscape more peaked around the observed points, as in this case, the energy (the probability mass) concentrates on a low-dimensional manifold, as opposed to be flattened out in the whole space.

2. Optimization: Stochastic Gradient Descent, a blog post

Quote:

Momentum

If the objective has the form of a long shallow ravine leading to the optimum and steep walls on the sides, standard SGD will tend to oscillate across the narrow ravine since the negative gradient will point down one of the steep sides rather than along the ravine towards the optimum. The objectives of deep architectures have this form near local optima and thus standard SGD can lead to very slow convergence particularly after the initial steep gains. Momentum is one method for pushing the objective more quickly along the shallow ravine. The momentum update is given by,


In the above equation v is the current velocity vector which is of the same dimension as the parameter vector θ. The learning rate α is as described above, although when using momentum α may need to be smaller since the magnitude of the gradient will be larger. Finally γ∈(0,1] determines for how many iterations the previous gradients are incorporated into the current update. Generally γ is set to 0.5 until the initial learning stabilizes and then is increased to 0.9 or higher.