Neural Networks 101

Deep learning models are built on neural networks. But they differ in the number of hidden layers and, as a result, in the complexity of the phenomena they can model.

On this page, we will describe how neural networks work, which should give a foundation on which to understand how deep neural networks work.

What is a neural network

A neural network can be defined as a computing system made up of a number of simple, highly interconnected processing elements, responding to the input data through some sort of function. These processing elements are called nodes. Each node in a neural network imitates biological neurons by taking input data and performing simple operations on the data, passing the results on to other nodes based on weights and an activation function.

The neural network is structured in layers made up of these nodes, which are connected to nodes in the next layer by weighted edges, and these functions affect how input is related to the output. The middle layers of the network are called 'hidden' layers, because their inner workings are hidden from the user of the model. This structure is shown in the graphic below.

The series of layers between input & output do feature identification and processing in a series of stages, just as our brains seem to. The degree to which this mimics processes in the brain is a bit of a philosophical discussion, which does not fit in the scope of this document, but some aspects of this process are preserved.

The number of hidden layers distinguishes deep neural networks from "shallow" neural networks, with deep neural networks having at least more than 2 hidden layers. A shallow neural network, looks something like this, and we have had good algorithms for learning the weights for these networks for around 30 years.

But those algorithms for learning the weights struggled in networks with more hidden layers, as shown below.

The more layers you get, you run into a problem of ‘Vanishing Gradient’ and the time taken to train the initial layers increases drastically. This obstacle had to be overcome for deeper networks to be introduced.

Walking through the training process

Let's take a simple example and walk through how inputs are processed by each node.

Assume we have 2 inputs: x1 and x2. These are real valued inputs (numbers) that represent something important in our data. Let's say for our purposes that x1 is the dose of drug to treat high cholesterol and x2 is some patient characteristic, like their starting cholesterol level. Suppose we give this drug in varying doses (x1) to 10,000 patients, whose initial cholesterol levels we have recorded (x2) and measure whether the drug reduces or does not reduce their cholesterol levels (output = Y or N).

First, we take 70% of the patient data for 'training', to build the model, and 30% of the patient data for testing, to see how well our model performs.

Lets walk through how training on this data works.

Lets say we take one patient's data and their x1 and x2 values. Our starting neural net has assigned random initial weights for these inputs. Here is what happens with these inputs within one neuron.

This gives us the output from one neuron at the very first layer. We repeat this for two more neurons in the first layer as shown below.

Now we have three outputs from the three neurons in the first layer: y1, y2, and y3. These then serve as inputs to the next layer, which has 2 neurons. These are similarly multiplied by some randomly assigned weight, summed together, and applied to an activation function to get the output to the final layer, as shown below.

Finally, the outputs from layer 2 (y4 and y5) are combined in the final layer to produce a decision: is the patient responding to the drug or not?

If we are training, we know how this patient actually responded to the drug (yes). Say that our randomly selected weights produce the wrong result (no). We need to try to 'fix' this result by computing a 'loss' and using this loss to gradually fix the incorrect weights at the previous layers.

This process of moving back through the network to fix the weights is called back-propagation. Here are some pictures to illustrate how it works.

First we compute the error signal d of the output layer.

Then we propagate the error signal back through the entire network, as shown below.

Then we use this error to 'fix' the weights, by very gradually moving in the right direction. Recall from calculus that the derivative, shown in the equation below, essentially gives you a slope (or a direction). We take a small step in the right direction, and control the rate of the step using a 'learning rate' parameter.

After repeating this process of correcting the weights, we can then take a new set of input values (x1, x2) and outputs (y), and run the same process of feeding the inputs forward through the network, and back propagating the loss to fix the errors, until we reach some optimum.

The weights we arrive at will ideally produce a high success rate if we evaluate the effectiveness of the neural network on some testing data.

(Credit to Brian Ziebart's machine learning lecture slides at UIC for the images in this section, which have been slightly adapted from the originals).

Activation functions

The activation function in a neural network can be defined as the output of a given node node under a set of input. This is the function you apply to the summed inputs*weights from the previous layer.

There are various types of activation functions, most of which have been mentioned in the figure below. The most commonly used type of activation function is Sigmoid.

Picking a good activation function and the 'hyperparameters' for these functions is part science, part art. Here is a good guide if you want to learn more.

http://cs231n.github.io/neural-networks-1/#actfun