An MLP (for Multi-Layer Perceptron) or multi-layer neural network defines a family of functions. Let us first consider the most classical case of a single hidden layer neural network, mapping a -vector to an -vector (e.g. for regression):
output is an affine transformation of the hidden layer
where is a -vector (the input), is an matrix (called input-to-hidden weights), is a -vector (called hidden units offsets or hidden unit biases), is an -vector (called output units offset or output units biases), and is an matrix (called hidden-to-output weights).
The vector-valued function is called the output of the hidden layer. Note how the output is an affine transformation of the hidden layer, in the above network. A non-linearity may be tacked on to it in some network architectures. The elements of the hidden layer are called hidden units.
The kind of operation computed by the above can be applied on itself, but with different parameters (different biases and weights). This would give rise to a feedforward multi-layer network with two hidden layers. More generally, one can build a deep neural network by stacking more such layers. Each of these layers may have a different dimension ( above). A common variant is to have skip connections, i.e., a layer can take as input not only the layer at the previous level but also some of the lower layers.
Let with representing the output non-linearity function. In supervised learning, the output can be compared with a target value through a loss functional . Here are common loss functionals, with the associated output non-linearity:
We just apply the recursive gradient computation algorithm seen previously to the graph formed naturally by the MLP, with one node for each input unit, hidden unit and output unit. Note that each parameter (weight or bias) also corresponds to a node, and the final
Let us formalize a notation for MLPs with more than one hidden layer. Let us denote with the output vector of the i-th layer, starting with (the input), and finishing with a special output layer which produces the prediction or output of the network.
With tanh units in the hidden layers, we have (in matrix-vector notation):
In the case of a probabilistic classifier, we would then have a softmax output layer, e.g.,
where we used to denote the output because it is a vector indicating a probability distribution over classes. And the loss is
where is the target class, i.e., we want to maximize , an estimator of the conditional probability of class given input .
Let us now see how the recursive application of the chain rule in flow graphs is instantiated in this structure. First of all, let us denote
(for the argument of the non-linearity at each level) and note (from a small derivation) that
Now let us apply the back-propagation recipe in the corresponding flow graph. Each parameter (each weight and each bias) is a node, each neuron potential and each neuron output is also a node.
Logistic regression is a special case of the MLP with no hidden layer (the input is directly connected to the output) and the cross-entropy (sigmoid output) or negative log-likelihood (softmax output) loss. It corresponds to a probabilistic linear classifier and the training criterion is convex in terms of the parameters (which garantees that there is only one minimum, which is global).
Many algorithms have been proposed to train multi-layer neural networks but the most commonly used ones are gradient-based.
Two fundamental issues guide the various strategies employed in training MLPs:
Artificial Intelligence + NLP + deep learning > AI > Machine Learning > Neural Networks > Deep Learning > python > Theano MNIST > 2 Multi-layer Perceptron >