Introduction to MultiLayer Perceptrons (Feedforward Neural Networks)MultiLayer Neural NetworksAn MLP (for MultiLayer Perceptron) or multilayer neural network defines a family of functions. Let us first consider the most classical case of a single hidden layer neural network, mapping a vector to an vector (e.g. for regression): output is an affine transformation of the hidden layer where is a vector (the input), is an matrix (called inputtohidden weights), is a vector (called hidden units offsets or hidden unit biases), is an vector (called output units offset or output units biases), and is an matrix (called hiddentooutput weights). The vectorvalued function is called the output of the hidden layer. Note how the output is an affine transformation of the hidden layer, in the above network. A nonlinearity may be tacked on to it in some network architectures. The elements of the hidden layer are called hidden units. The kind of operation computed by the above can be applied on itself, but with different parameters (different biases and weights). This would give rise to a feedforward multilayer network with two hidden layers. More generally, one can build a deep neural network by stacking more such layers. Each of these layers may have a different dimension ( above). A common variant is to have skip connections, i.e., a layer can take as input not only the layer at the previous level but also some of the lower layers. Most Common Training Criteria and Output NonLinearitiesLet with representing the output nonlinearity function. In supervised learning, the output can be compared with a target value through a loss functional . Here are common loss functionals, with the associated output nonlinearity:
The BackPropagation AlgorithmWe just apply the recursive gradient computation algorithm seen previously to the graph formed naturally by the MLP, with one node for each input unit, hidden unit and output unit. Note that each parameter (weight or bias) also corresponds to a node, and the final Let us formalize a notation for MLPs with more than one hidden layer. Let us denote with the output vector of the ith layer, starting with (the input), and finishing with a special output layer which produces the prediction or output of the network. With tanh units in the hidden layers, we have (in matrixvector notation):
In the case of a probabilistic classifier, we would then have a softmax output layer, e.g., where we used to denote the output because it is a vector indicating a probability distribution over classes. And the loss is where is the target class, i.e., we want to maximize , an estimator of the conditional probability of class given input . Let us now see how the recursive application of the chain rule in flow graphs is instantiated in this structure. First of all, let us denote (for the argument of the nonlinearity at each level) and note (from a small derivation) that and that . Now let us apply the backpropagation recipe in the corresponding flow graph. Each parameter (each weight and each bias) is a node, each neuron potential and each neuron output is also a node.
Logistic RegressionLogistic regression is a special case of the MLP with no hidden layer (the input is directly connected to the output) and the crossentropy (sigmoid output) or negative loglikelihood (softmax output) loss. It corresponds to a probabilistic linear classifier and the training criterion is convex in terms of the parameters (which garantees that there is only one minimum, which is global). Training MultiLayer Neural NetworksMany algorithms have been proposed to train multilayer neural networks but the most commonly used ones are gradientbased. Two fundamental issues guide the various strategies employed in training MLPs:

Artificial Intelligence + NLP + deep learning > AI > Machine Learning > Neural Networks > Deep Learning > python > Theano MNIST > 2 Multilayer Perceptron >