Introduction to Multi-Layer Perceptrons (Feedforward Neural Networks)

Multi-Layer Neural Networks

An MLP (for Multi-Layer Perceptron) or multi-layer neural network defines a family of functions. Let us first consider the most classical case of a single hidden layer neural network, mapping a d-vector to an m-vector (e.g. for regression):

g(x) = b + W \tanh(c + V x)    output is an affine transformation of the hidden layer 

where x is a d-vector (the input), V is an k \times d matrix (called input-to-hidden weights), c is a k-vector (called hidden units offsets or hidden unit biases), b is an m-vector (called output units offset or output units biases), and W is an m \times h matrix (called hidden-to-output weights).

The vector-valued function h(x)=\tanh(c + V x) is called the output of the hidden layer. Note how the output is an affine transformation of the hidden layer, in the above network. A non-linearity may be tacked on to it in some network architectures. The elements of the hidden layer are called hidden units.

The kind of operation computed by the above h(x) can be applied on h(x) itself, but with different parameters (different biases and weights). This would give rise to a feedforward multi-layer network with two hidden layers. More generally, one can build a deep neural network by stacking more such layers. Each of these layers may have a different dimension (k above). A common variant is to have skip connections, i.e., a layer can take as input not only the layer at the previous level but also some of the lower layers.

Most Common Training Criteria and Output Non-Linearities

Let f(x)=r(g(x)) with r representing the output non-linearity function. In supervised learning, the output f(x) can be compared with a target value y through a loss functional L(f,(x,y)). Here are common loss functionals, with the associated output non-linearity:

  • for ordinary (L2) regression: no non-linearity (r(a)=a), squared loss L(f,(x,y))=||f(x)-y||^2 = \sum_i (f_i(x) - y_i)^2.
  • for median (L1) regression: no non-linearity (r(a)=a), absolute value loss L(f,(x,y))=|f(x)-y|_1 = \sum_i |f_i(x) - y_i|.
  • for 2-way probabilistic classification: sigmoid non-linearity (r(a)={\rm sigmoid}(a)=1/(1+e^{-a}, applied element by element), and cross-entropy loss L(f,(x,y))= -y \log f(x) -(1-y)\log(1-f(x)) for y binary. Note that the sigmoid output f(x) is in the (0,1) interval, and corresponds to an estimator of P(y=1|x). The predicted class is 1 if f(x)>\frac{1}{2}.
  • for multiple binary probabilistic classification: each output element is treated as above.
  • for 2-way hard classification with hinge loss: no non-linearity (r(a)=a) and the hinge loss is L(f,(x,y))= \max(0,1 - (2y-1) f(x)) (again for binary y). This is the SVM classifier loss.
  • the above can be generalized to multiple classes by separately considering the binary classifications of each class against the others.
  • multi-way probabilistic classification: softmax non-linearity (r_i(a) = e^{a_i}/\sum_j e^{a_j} with one output per class) with the negative log-likelihood loss L(f,(x,y)) = - \log f_y(x). Note that \sum_i f_i(x)=1 and 0<f_i(x)<1. Note also how this is equivalent to the cross-entropy loss in the 2-class case (the output for the one of the classes is actually redundant).

The Back-Propagation Algorithm

We just apply the recursive gradient computation algorithm seen previously to the graph formed naturally by the MLP, with one node for each input unit, hidden unit and output unit. Note that each parameter (weight or bias) also corresponds to a node, and the final

Let us formalize a notation for MLPs with more than one hidden layer. Let us denote with h_i the output vector of the i-th layer, starting with h_0=x (the input), and finishing with a special output layer h_L which produces the prediction or output of the network.

With tanh units in the hidden layers, we have (in matrix-vector notation):

  • for k = 1 to L-1:
  • h_k = {\rm tanh}(b_k + W_k h_{k-1})

    where b_k is a vector of biases and W_k is a matrix of weights connecting layer k-1 to layer k. The scalar computation associated with a single unit i of layer k is

    h_{ki} = {\rm tanh}(b_{ki} + \sum_j W_{kij} h_{k-1,j})

In the case of a probabilistic classifier, we would then have a softmax output layer, e.g.,

  • p = h_L = {\rm softmax}(b_L + W_L h_{L-1})

where we used p to denote the output because it is a vector indicating a probability distribution over classes. And the loss is

  • L = - \log p_y

where y is the target class, i.e., we want to maximize p_y=P(Y=y|x), an estimator of the conditional probability of class y given input x.

Let us now see how the recursive application of the chain rule in flow graphs is instantiated in this structure. First of all, let us denote

a_k=b_k + W_k h_{k-1}

(for the argument of the non-linearity at each level) and note (from a small derivation) that

\frac{\partial(- \log p_y)}{\partial a_{L,i}} = (p_i - 1_{y=i})

and that

\frac{\partial \tanh(u)}{\partial u} = (1-\tanh(u)^2).

Now let us apply the back-propagation recipe in the corresponding flow graph. Each parameter (each weight and each bias) is a node, each neuron potential a_{ik} and each neuron output h_{ik} is also a node.

  • starting at the output node: \frac{\partial L}{\partial L} = 1

  • then compute the gradient with respect to each pre-softmax sum a_{L,i}: \frac{\partial L}{\partial a_{L,i}} = \frac{\partial L}{\partial L} \frac{\partial L}{\partial a_{L,i}} = (p_i - 1_{y=i})

  • We can now repeat the same recipe for each layer. For k=L down to 1

    • obtain trivially the gradient wrt biases: \frac{\partial L}{\partial b_{k,i}} = \frac{\partial L}{\partial a_{k,i}} \frac{\partial a_{k,i}}{\partial b_{k,i}}=\frac{\partial L}{\partial a_{k,i}}

    • compute the gradient wrt weights: \frac{\partial L}{\partial W_{k,i,j}} = \frac{\partial L}{\partial a_{k,i}}\frac{\partial a_{k,i}}{\partial W_{k,i,j}} = \frac{\partial L}{\partial a_{k,i}} h_{k-1,j}

    • back-propagate the gradient into lower layer, if k>1:

      • \frac{\partial L}{\partial h_{k-1,j}} = \sum_i \frac{\partial L}{\partial a_{k,i}} \frac{\partial a_{k,i}}{\partial h_{k-1,j}} = \sum_i \frac{\partial L}{\partial a_{ki}} W_{k,i,j}
      • \frac{\partial L}{\partial a_{k-1,j}} = \frac{\partial L}{\partial h_{k-1,j}} \frac{\partial h_{k-1,j}}{\partial a_{k-1,j}}=\frac{\partial L}{\partial h_{k-1,j}} (1 - h_{k-1,j}^2)

Logistic Regression

Logistic regression is a special case of the MLP with no hidden layer (the input is directly connected to the output) and the cross-entropy (sigmoid output) or negative log-likelihood (softmax output) loss. It corresponds to a probabilistic linear classifier and the training criterion is convex in terms of the parameters (which garantees that there is only one minimum, which is global).

Training Multi-Layer Neural Networks

Many algorithms have been proposed to train multi-layer neural networks but the most commonly used ones are gradient-based.

Two fundamental issues guide the various strategies employed in training MLPs:

  • training as efficiently as possible, i.e., getting training error down as quickly as possible, avoiding to get stuck in narrow valleys or even local minima of the cost function,
  • controlling capacity so as to achieve the largest capacity avoids overfitting, i.e., to minimize generalization error.