Trang chủ‎ > ‎IT‎ > ‎DEEP LEARNING‎ > ‎

Deep Learning Basics: Neural Networks, Backpropagation and Stochastic Gradient Descent

In the last couple of years Deep Learning has received a great deal of press. This press is not without warrant - Deep Learninghas produced stat-of-the-art results in many computer vision and speech processing tasks. However, I believe that the press has given people the impression that Deep Learning is some kind of imprenetrable, esoteric field that can only be understood by academics. In this blog post I want to try to erase that impression and provide a practical overview of some of Deep Learning's basic concepts.

At its core, Deep Learning is a class of of neural network models. That is, a model with an input layer, an output layer, and an arbitrary number of hidden layers. These layers are made up of neurons or neural units. They are called neurons because they share some similarities with the behaviour of the neurons present in the human brain (though this comparison has drawn a lot of criticism from neuroscientists). For our purposes, we can think of a neuron as a nonlinear function of the weighted sum of its inputs. Since the neuron is really the most basic part of any Deep Learning model it is a good place to start.

The Single Neuron Model

A neuron is a function that maps an input vector {x1,...,xK}{x1,...,xK} to a scalar output yy via a weight vector {w1,...,wK}{w1,...,wK} and a nonlinear function ff.

general graphical model

The function ff takes a weighted sum of the inputs and returns yy.

y=f(i=0Kwixi)=f(wTx)y=f(∑i=0Kwixi)=f(wTx)

Often an additional element is added to the input vector that is always equal to 11 with a corresponding additional weight element which acts as a bias. The function ff is called the link function which provides the nonlinearity between the input and output. A common choice for this link function is the logistic function which is defined as

f(u)=11+euf(u)=11+eu

With the appropriate substitutions the final formula for the single neuron model becomes

y=11+ewTxy=11+ewTx

If you plot the logistic function,

general graphical model

you can see that it is smooth and differentiable and bound between 00 and 11. We shall see that these are two important properties. The derivative of the logistic function is simply

df(u)du=f(u)(1f(u))=f(u)f(u)df(u)du=f(u)(1−f(u))=f(u)f(−u)

This derivative will be used when we learn the weight vector ww via stochastic gradient descent.

Like any optimization problem, our goal is to minimize an objective function. Traditionally, the objective function measures the difference between the actual output tt and the predicted output f(wTx)f(wTx). In this case we will be using the squared loss function

E=12(ty)2=12(tf(wTx))2E=12(t−y)2=12(t−f(wTx))2

We want to find the weights ww such that the above objective is minimized. We do this with stochastic gradient descent (SGD). In SGD we iteratively update our weight parameters in the direction of the gradient of the loss function until we have reached a minimum. Unlike traditional gradient descent, we do not use the entire dataset to compute the gradient at each iteration. Instead, at each iteration we randomly select a single data point from our dataset and move in the direction of the gradient with respect to that data point. Obviously this is only an approximation of the true gradient but it can be proven that we will eventually reach the minimum by following this noisey gradient. There are several advantages to using stochastic gradient descent over traditional gradient descent.

  1. Gradient descent requires loading the entire dataset into main memory. If your dataset is large this can be problematic. Stochastic gradient descent only requires one data point at a time (or sometimes a minibatch of data points) which is much less memory intensive.
  2. Most datasets have redundancy in them. Traditional gradient descent requires one full pass over the data until an update is made. Due to redundancy, a meaningful update can often be made without iterating over the entire dataset as with stochastic gradient descent.
  3. As a consequence of the previous point, stochastic gradient descent can often converge faster than traditional gradient descent. It is also guaranteed to find the global minimum if the loss function is convex.

Our objective function EE is already defined in terms of a single data point so let's procede to compute its gradient with respect to an aribtrary element of our weight vector wiwi.

Ewi=Eyyuuwi=(yt)y(1y)xi(1)(2)(1)∂E∂wi=∂E∂y⋅∂y∂u⋅∂u∂wi(2)=(y−t)⋅y(1−y)⋅xi

Now we are able to obtain the stochastic gradient descent update equation (in vector notation)

wnew=woldη(yt)y(1y)xwnew=wold−η⋅(y−t)⋅y(1−y)⋅x

Where η>0η>0 is the step size. As stated previously, (x,y)(x,y) data points are sequentially fed into this update equation until the weights ww converge to their optimal value. This is how we use stochastic gradient descent to learn the weights for the single neuron model.

What we just did is also known as logistic regression and if we had replaced our logistic function with a unit step function we would have made what is known as a perceptron! Now let's extend this relatively simple model to something a bit more complex...

The Neural Network

A neural network consists of an input layer, output layer, and hidden layer. Our input layer consists of the input vector x={x1,...,xK}x={x1,...,xK}. The hidden layer consists of a vector of NN neurons h={h1,...,hN}h={h1,...,hN}. Finally there is an output layer with one neuron for every element of the output vector y={y1,...,yM}y={y1,...,yM}. Every element in the input layer is connected to every neuron in the hidden layer with wkiwki indicating the weight associated with the connection between the kthkth input element and the ithith hidden neuron. The same connection structure is present between the hidden and output layers with wijwij′ indicating the weight associated with the connection between the ithith hidden neuron and the jthjth output neuron. This network structure is better illustrated in the below diagram.

general graphical model

It is helpful to think of the weight wkiwki as the the (k,i)th(k,i)th entry in a K×NK×N weight matrix WW and similarly weight wijwij′ as the (i,j)th(i,j)th entry in a N×MN×M weight matrix WW′. The output of each neuron in the hidden and output layer is computed in the exact same way as before. It is simply the logistic function applied to the weighted sum of the neuron's inputs. For example, the output of an arbitrary neuron in the hidden layer hihi is

hi=f(ui)=f(k=1Kwkixk)hi=f(ui)=f(∑k=1Kwkixk)

and similarly for the output of an arbitrary output neuron yjyj is

yj=f(uj)=f(i=1Nwijhi)yj=f(uj′)=f(∑i=1Nwij′hi)

The objective function is also the same as before except now it is summed over all elements in the output layer.

E=12j=1M(yjtj)2E=12∑j=1M(yj−tj)2

Unlike before, we need to construct update equations for both sets of weights - the input-to-hidden layer weights wkiwki and the hidden-to-output weights wijwij′. In order to do this we need to compute the gradient of our objective function EE with respect to wkiwki as well as the gradient with respect to wijwij′. We must start with the gradient with respect to wijwij′ (the hidden-to-output weights) and we shall see why later. In order to compute Ewij∂E∂wij′ we must recall our high-school calculus, specifically the chain rule. From the chain rule, we must first take the derivative of EE with respect to yjyj′. Then we must take the derivative of yjyj (i.e. the logistic function) with respect to wijwij′ which needs yet another application of the chain rule. We first take the derivative of the logistic function with respect to its input ujuj′, then finally we can take the derivative of this input with respect to wijwij′ and we arrive at our desired value. This process is clearly defined below.

From the chain rule,

Ewij=Eyjyjujujwij∂E∂wij′=∂E∂yj⋅∂yj∂uj′⋅∂uj′∂wij′

The derivative of EE with respect to yjyj is simply,

Eyj=yjtj∂E∂yj=yj−tj

From the last section we saw that the derivative of the logistic function ff with respect to its input uu is f(u)(1f(u))f(u)(1−f(u)). If we apply this we get,

yjuj=yj(1yj)∂yj∂uj′=yj(1−yj)

where yj=f(uj)yj=f(uj′). Next we compute the derivative of uj=Ni=1wijhiuj′=∑i=1Nwij′hi with respect to a particular wijwij′ which is simply hihi. So, after making the appropriate subsitutions, we get

Ewij=(yjtj)yj(1yj)hi∂E∂wij′=(yj−tj)⋅yj(1−yj)⋅hi

With this gradient we can construct the update equation for wijwij′

wnewij=woldijη(yjtj)yj(1yj)hiwij′new=wij′old−η⋅(yj−tj)⋅yj(1−yj)⋅hi

Now let's turn our attention to the gradient of the objective function with respect to the input-to-hidden weights wkiwki. As we shall see, this gradient has already been partially computed when we computed the previous gradient.

Using the chain rule, the full gradient is

Ewki=j=1M(Eyjyjujujhi)hiuiuiwki∂E∂wki=∑j=1M(∂E∂yj⋅∂yj∂uj′⋅∂uj′∂hi)⋅∂hi∂ui⋅∂ui∂wki

The sum is due to the fact that the hidden unit that wkiwki connects to is itself connected to every output unit, thus each of these gradients need to be taken into account as well. We have already computed both Eyj∂E∂yj and yjuj∂yj∂uj′ which means that

Eyjyjuj=(yjtj)yj(1yj)∂E∂yj⋅∂yj∂uj′=(yj−tj)⋅yj(1−yj)

Now we need to compute the remaining derivatives ujhi∂uj′∂hihiui∂hi∂ui, and uiwki∂ui∂wki. So let's do just that.

ujhi=Ni=1wijhihi=wij∂uj′∂hi=∂∑i=1Nwij′hi∂hi=wij′

and, again using the derivative of the logistic function

hi