Trang chủ‎ > ‎IT‎ > ‎DEEP LEARNING‎ > ‎

How does deep learning work and how is it different from normal neural networks

A good point where to start exploring is a work-in-progress survey paper by Schmidbauer ([1404.7828] Deep Learning in Neural Networks: An Overview)
And try to read these two papers:

http://www.cs.toronto.edu/~hinto...
http://machinelearning.wustl.edu...

-------------------------------------------------------------------------------------------------

Deep neural network is feedforward network with many hidden layers

This is more or less all there is to say about the definition. Neural networks can be recurrent or feedforward; feedforward ones do not have any loops in their graph and can be organized in layers. If there are many layers, then we say that the network is deep.

How many layers does a network have to have in order to qualify as deep? There is no definite answer to this (it's a bit like asking how many grains make a heap), but usually having two or more hidden layers counts as deep. I suspect that there will be some inflation going on here, and in 10 years people might think that anything with less than 10 layers is shallow and suitable only for kindergarten exercises. Informally, "deep" suggests that the network is tough to handle.

Here is an illustration, adapted from here:

Deep vs non-deep neural network

Why would having many layers be beneficial?

You wrote that

10 years ago in class I learned that having several layers or one layer (not counting the input and output layers) was equivalent in terms of the functions a neural network is able to represent [...]

This is not correct: it is not equivalent. What you perhaps are remembering is that for a network with linear units the number of layers does not matter: whatever the number of layers is, such a network can only represent linear functions. For example, one of the first successful neural networks in history, Perceptron developed in 1957 by Frank Rosenblatt, was a linear network with one hidden layer. There is no sense in adding more layers to perceptron, this will not improve its performance.

However, if the units are nonlinear (as they always are in modern applications) this is not the case anymore. If e.g. all your hidden units were to perform x2x2 transformation, then having one layer only allows the network to represent quadratic functions, but having two or three would allow it to represent polynomials of 4th of 6th order. Nonlinear units that are often used nowadays are rectified linear units; their transfer function is f(x)=xf(x)=x for x0x≥0 and f(x)=0f(x)=0 for x<0x<0. Having more layers means nesting these functions inside each other and this certainly allows the network to represent more and more complex functions.

Update: In case you were referring to the Universal approximation theorem that states that any function can be arbitrarily well approximated by a neural network with a single hidden layer. I am not very familiar with it, but as far as I understand, this requires exponentially many neurons in the hidden layer and hence is completely unpractical. See an interesting answer on cstheory.SE and also this blog post.

If more layers is beneficial then why not having a lot of them?

The problem is that deep neural networks are (or at least used to be) very hard to train. The standard algorithm for training neural networks is simply gradient descent (that's what backpropagation is all about). When there are many layers this algorithm runs into a problem known as vanishing gradients (you can read up on it e.g. here). Informally, backpropagation computes the derivatives for all network weights using chain rule, and for deep layers the chain becomes too long and derivatives very hard to estimate reliably. So the algorithm breaks down.

This was realized in the 1980s and this is why almost nobody was working on neural networks in 1990s. By the beginning of 2000s everybody in machine learning thought that neural networks are essentially dead.

So what changed in the mid 2000s?

One of the people who kept working on neural networks through all these dark years was Geoffrey Hinton. After 20+ years of doing this without any interest from anybody, he finally published a couple of breakthrough papers in 2006 suggesting an effective way to train deep neural network (Science paperNeural computation paper). The trick was to use unsupervised pre-training before the final training with standard methods. Hinton called these networks deep belief networks. This paper revolutionized the field, and deep neural networks became hot and sexy almost overnight (well, maybe overyear or two).

For a couple of years people thought that this unsupervised pre-training was the key.

Then it turned out that not really.

In 2010 Martens showed that deep neural networks can be trained with so called Hessian-free methods (that are essentially clever second-order methods and not first order as gradient descent) and can outperform networks trained with pre-training: Deep learning via Hessian-free optimization.

In 2013 Sutskever et al. showed that deep neural networks can be trained with stochastic gradient descent with some very clever modifications and tricks and can outperform networks trained with Hessian-free methods: On the importance of initialization and momentum in deep learning.

As you see, this is very recent research. People keep coming up with more and more effective ways to train deep networks. What seemed like a key insight 10 years ago is not necessarily a key insight today. All of that is largely driven by trial and error and there is little understanding of what makes some things work so well and some other things not. Training deep networks is like a big bag of tricks. Successful tricks are usually rationalized post factum.

I can't find an exact quote now but in some video lectures that I watched a couple of years ago Hinton said that the two key things that changed since middle 1980s and that allowed the current success of neural networks are:

  1. massive increase in computing power and
  2. massive increase in the amount of available training data.

Consider e.g. that Hinton himself is now working in Google; just imagine the size of the available datasets (think of all pictures Google can find on the web) and the available computing power that his team can use.

Further reading

If you want a nice and very recent soft overview, read LeCun, Bengio & Hinton, Deep learning, Nature 2015

---------------------------------------------------------------------------------------------------------------------

As far as I know, what is called Deep Neural Network (DNN) today has nothing fundamentally or philosophically different from the old standard Neural Network (NN). Although, in theory, one can approximate an arbitrary NN using a shallow NN with only one hidden layer, however, this does not mean that the two networks will perform similarly when trained using the same algorithm and training data. In fact there is a growing interest in training shallow networks that perform similarly to deep networks. The way this is done, however, is by training a deep network first, and then training the shallow network to imitate the final output (i.e. the output of the penultimate layer) of the deep network. See, what makes deep architectures favorable is that today's training techniques (back propagation) happen to work better when the neurons are laid out in a hierarchical structure.

Another question that may be asked is: why Neural Networks (DNNs in particular) became so popular suddenly. To my understanding, the magic ingredients that made DNNs so popular recently are the following:

A. Improved datasets and data processing capabilities

1. Large scale datasets with millions of diverse images became available

2. Fast GPU implementation was made available to public

B. Improved training algorithms and network architectures

1. Rectified Linear Units (ReLU) instead of sigmoid or tanh

2. Deep network architectures evolved over the years


A-1) Until very recently, at least in Computer Vision, we couldn't train models on millions of labeled images; simply because labeled datasets of that size did not exist. It turns out that, beside the number of images, the granularity of the label set is also a very crucial factor in the success of DNNs (see Figure 8 in this paper, by Azizpour et al.).

A-2) A lot of engineering effort has gone into making it possible to train DNNs that work well in practice, most notably, the advent of GPU implementations. One of the first successful GPU implementations of DNNs, runs on two parallel GPUs; yet, it takes about a week to train a DNN on 1.2 million images of 1000 categories using high-end GPUs (see this paper, by Krizhevsky et al.).

B-1) The use of simple Rectified Linear Units (ReLU) instead of sigmoid and tanh functions is probably the biggest building block in making training of DNNs possible. Note that both sigmoid and tanh functions have almost zero gradient almost everywhere, depending on how fast they transit from the low activation level to high; in the extreme case, when the transition is sudden, we get a step function that has slope zero everywhere except at one point where the transition happens.

B-2) The story of how neural network architectures developed over the years reminds me of how evolution changes an organism's structure in nature. Parameter sharing (e.g. in convolutional layers), dropout regularization, initialization, learning rate schedule, spatial pooling, sub-sampling in the deeper layers, and many other tricks that are now considered standard in training DNNs were developed, evolved, end tailored over the years to make the training of the deep networks possible the way it is today.

----------------------------------------------------------------------------------------------

"Normal" neural networks usually have one to two hidden layers and are used for SUPERVISED prediction or classification.

SVMs are typically used for binary classification, but occasionally for other SUPERVISED learning tasks.

Deep learning neural network architectures differ from "normal" neural networks because they have more hidden layers. Deep learning networks differ from "normal" neural networks and SVMs because they can be trained in an UNSUPERVISED or SUPERVISED manner for both UNSUPERVISED and SUPERVISED learning tasks.

Moreover, people often talk about training a deep network in an unsupervised manner, before training the network in a supervised manner.

------------------------------------------------------------------------------------------------

How do you train an unsupervised neural network?

Usually, with a supervised neural network you try to predict a target vector y, from a matrix of inputs, x. But when you train an unsupervised neural network, you try to predict the matrix x using the very same matrix x as the inputs. In doing this, the network can learn something intrinsic about the data without the help of a target or label vector that is often created by humans. The learned information is stored as the weights of the network.

Another consequence of unsupervised training is that the network will have the same number of input units as target units, because there are the same number of columns in the input x matrix as in the target x matrix. This leads to the hourglass shape that is common when training unsupervised, deep neural networks.

In the diagram below, there are the same number of input units as target units, and each of these units represents a pixel in a small picture of a digit.

You might think it sounds easy to predict x from x. Sometimes it is too easy, and the network becomes over trained on the x matrix, so people typically add some noise, or random numbers, into x to prevent over training.

One of the fancy names for this kind of architecture is: "stacked denoising autoencoder". You might also hear "restricted Boltzmann machine".

---

Why so many layers?

Deep learning works because of the architecture of the network AND the optimization routine applied to that architecture.

The network is a directed graph, meaning that each hidden unit is connected to many other hidden units below it. So each hidden layer going further into the network is a NON-LINEAR combination of the layers below it, because of all the combining and recombining of the outputs from all the previous units in combination with their activation functions.

When the OPTIMIZATION routine is applied to the network, each hidden layer then becomes an OPTIMALLY WEIGHTED, NON-LINEAR combination of the layer below it.

When each sequential hidden layer has less units than the one below it, each hidden layer becomes a LOWER DIMENSIONAL PROJECTION of the layer below it as well. So the information from the layer below is nicely summarized by a NON-LINEAR, OPTIMALLY WEIGHTED, LOWER DIMENSIONAL PROJECTION in each subsequent layer of the deep network.

In the picture above, the outputs from the small middle hidden layer are a two dimensional, optimal, non-linear projection of the input columns (i.e .pixels) in the input matrix (i.e. set of pictures). Figs. 3a and 3b in the Hinton paper above actually plot similar outputs. Notice that the network has basically clustered the digits 0 through 9 without a label vector. So, the unsupervised training process has resulted in unsupervised learning.

---

How do you make predictions?

That's the easy part. One approach is to break the hourglass network in half, and swap x as the target matrix with y, where y is some more typical target vector or label vector.

In the picture above, you could throw away all the layers above the middle layer, and put a single target unit for y right above the middle hidden layer.

What you are really using from the bottom half of the hourglass network is the weights from the unsupervised training phase. Remember, the weights represent what was learned during unsupervised training. They will now become the initial starting points for the supervised training optimization routine using the target vector y. (In the case of the digit pictures, y contains the label 0-9 of the digit.) So, the supervised training phase basically just refines the weights from the unsupervised training phase to best predict y. Since we have changed the architecture of the network to a more "normal" supervised network, the actual mechanism of prediction is similar to a "normal" neural network.

-------------------------------------------------------------------------------------------------------

I think a nice way to sum it up appeared in LeCun's slideshow (@Page on nyu.edu) - in regular applications of neural networks and SVMs, you hand-craft elaborate feature extractors that create suitable higher-level features from the raw data, actually you spend most of your time on tuning that. In the deep learning paradigm, high-level feature extraction is part of the automated solution. This can be framed in a variety of ways - either simply as extra layers of your neural networks (that are somewhat specialized to avoid the vanishing gradient problem; ConvNets), or e.g. as "dimensionality reduction" (like PCA but more awesome; auto-encoders).

I think a good point where to start exploring is a work-in-progress survey paper by Schmidbauer ([1404.7828] Deep Learning in Neural Networks: An Overview).  Aside of LeCun's slides above then, a nice slideshow focused (much but not exclusively) on NLP applications and RNNs is Socher's tutorial (Richard Socher - Deep Learning Tutorial)

-------------------------------------------------------------------------------------------------------

Deep neural networks were a set of techniques that were discovered to overcome the vanishing gradient problem which was severely limiting the depth of neural networks.

It’s as simple as that.

Neural networks are trained using backpropagation gradient descent. That is, you update the weights of each layer as a function of the derivative of the previous layer. The problem is that the update signal was lost as you increased the depth of the neural network. The math is pretty simple. Check this chapter of this online book: Neural networks and deep learning.

Therefore, in the old days, people pretty much only used neural networks with a single hidden-layer.

These new techniques include things like using RELU instead of sigmoids as activation functions. RELU are of the form f(x)=max(0,x) and so they have non-vanishing derivative. But there are other techniques like using the sign of the derivative, rather than the magnitude, in the backpropagation optimization problem.

Now, what is cool about them is that, by enabling us to build very big neural networks, it has opened the door to such things as auto-encoders for unsupervised problems, convolutional neural networks to classify images, recurrent neural networks for time series, etc, etc. It was a revolution. But essentially, it’s the same old neural networks, just with bigger and cooler network topologies that can learn more advanced and exciting stuff. Some of these start to resemble the human brain.

With regard to SVM, I did not understand your question. You can see a SVM as a type of neural network. It’s discovery was inspired by neural networks in fact. Instead of non-linear transformations of linear combinations, they use kernel functions to combine variables. It’s a very different beast. The optimization is nicer and not as prone to getting stuck in local minima, but, on the other hand, it is not as versatile. And, by the way, there are both supervised and unsupervised SVMs, unlike what another responder said. Take a one-class SVM for example. But there are others.

-------------------------------------------------------------------------------------------------------

LeCun has tried to understand Deep Learning in terms of modern spin glass theory
"The Unreasonable Effectiveness of Deep Learning"

IMHO, this analysis is probably not sufficient

Beyond this, it is has been suggested that Unsupervised Deep Learning Networks implement a form of Variational Renormalization Group Theory 
http://arXiv.org/pdf/1410.3831v1.pdf

Others have suggested that Deep Learning is learning local group structures
"WHY DOES UNSUPERVISED DEEP LEARNING WORK? - A PERSPECTIVE FROM GROUP THEORY"
Page on arxiv.org
although this is probably equivalent to saying that the RG flow map has either a good  fixed point or several useful cycles.

This is probably all correct--and in addition to these, I suspect that  Deep Learning algos is related to Spin Funnels from protein folding
Why does Deep Learning work?

and some other deep areas from theoretical physics like Renormalization Group theory
Why Deep Learning Works II:  the Renormalization Group

--------------------------------------------------------------------------------------------------

Comments