Perceptron Learning Algorithm

What is a Perceptron?

Perceptron is a supervised learning algorithm that is used for simple linear regression. This basically means, if your dataset has a linear pattern, then this algorithm can detect it. It is the most simple algorithm and easiest to make. Here, we'll discuss the mathematics behind the perceptron, and use it to fit a line through some data that we have (which is the simplest form of regression).

Before Getting Started

Conventions

Before we get started, it's suggested that we understand a few conventions. These may differ from source to source, we'll be using the following conventions for this page. Keep a note of them, don't worry if you don't understand everything right away. You can always scroll back to these lines...

How does a dataset look?

It's essential to know how does a dataset look before we get started. It essentially can be imagined as a spreadsheet, an excel spreadsheet, if you will. So you have all your data stored in forms of rows and columns. For the convenience of recognition, all the columns depicting the output are at the end and all the other columns are input columns, since a perceptron has only one output, there is only one output column and that's in the end.
- There are 'm' rows in your dataset. Each row being a training example.
- We'll add a new column in the beginning of inputs and this will contain 1 for every training example. Now the number of input columns are 'n' (including the column of ones that we just added and of course removing the output column).

Inputs

As you can see, the inputs are denoted by 'x' and they're represented as a matrix having 'm' rows and 'n' columns, with each column representing some feature. Since these inputs will be used to train our algorithm, these are also called the training examples. Thus, the first three conventions are clear now. Bear with me here, I'll explain the +1 in 'n' value later in this same page when we discuss the working, but for now, understand that the first column is generally all 1s and then the proceeding columns hold the actual data (we'll call this a bias column and I'll explain this later in this same page).

Outputs

Since we're doing supervised learning, we need labelled data. By labelled data, I mean that the training examples must have a labelled and decisive output. These are called the desired outputs or actual outputs and are denoted by 'y'. As you can observe from the image, they're a matrix having 'm' rows and 1 column (they can have more, but for a single perceptron, you can have only one output).

Example of a dataset

Here's how a standard dataset that can be used to predict housing prices looks like

Values are for demonstration purposes only

Housing Prices - A demo dataset

We need an algorithm to get the output (in green) when we give it an input (in red), even if it's something outside this dataset.

The column starting in yellow and having all ones (in blue), the first column, is what we've added after getting the dataset. It won't always be provided with the dataset. Usually you'll only get the columns with purple headers.
The stuff in red (and blue, which we later added) are inputs.
The stuff in green is the output.
The features are
- 'm' = 6 (Six rows or training examples)
- 'n' = 3 (in red) + 1 (in blue) = 4 (feature columns)

The inputs are loaded into a matrix 'x'. Note that the first column is what we've added, so usually we create the matrix from the dataset we receive and then prepend the column of ones in the program, that way we don't have to create a column in the dataset in spreadsheet but it exists where we need it. This column is called the bias column (we'll discuss more about this later).
As you can observe, this matrix has 'm' rows (6) and 'n' columns (4).
You address the individual elements using the sub-scripting format. I've shown an example as well. Note that I've used square brackets '[]' and a comma ',' just to avoid ambiguity in the future. Basically '[4,2]' makes better sense than '4 2' (both meaning 4th row and 2nd column). I'll be following this while referring to individual elements.
When I use the parenthesis '()' in subscript, I'll be mentioning the size (or shape) of the matrix. When I use the square brackets '[]', I'll be mentioning an index inside it to access a single element or a set of elements like I've done using the colon ':' operator (means everything from start to end, you can also specify range, however).

Outputs go into a matrix named 'y'. This matrix has 'm' rows (6) and one column. A matrix with one column only is usually referred as vector, and a matrix with just one row is usually referred as an array.
Actually, it can have multiple columns (there can be multiple things that you're looking to predict, but the perceptron can only deliver one output, so we'll take it as one in this entire tutorial.
As before, notice how we address individual entries in this vector, using the square brackets '[]' and commas ',' in the notation.

Now that we understand how a dataset looks and how to represent it as a matrix, let's understand the working of a perceptron.

Working of a perceptron

First, have a look at the image below. We'll discuss about what all of this means but just keep a note of it.

Mathematical model of a perceptron

In the above diagram, 'w' stand for the weights of the connections. Every feature, including the bias column is assigned a weight, think of it like the amount of importance to that feature the perceptron gives in deciding the output. The output of the perceptron is denoted using 'v'. Keep these in mind as we go ahead

Processing Steps

Step 1: Calculating the weighed sum

When an input is given to the perceptron, it calculates a weighed sum using the weights w, in the very essence it does what's shown. We have to vary the 'i' values from 1 to m to get the weighed sum on every training example and then vertically stack them (as shown in the image, we've represented it as transpose to keep it short).
The same can be also done using matrix multiplication, in which case we directly obtain the matrices. As you can observe, we denote the weighed sum using a matrix 'z' having m rows and one column.
Another thing to note is that these weights are randomly initialized in the beginning.

Step 2: Pass the weighed sum through a function

When we've obtained the matrix 'z', we apply a function element wise to this matrix. This is done because very rarely does the weighed sum suffice to get the output, so we apply a function to the weighed sum to get an output which we depict using 'v'. This is also known as the hypothesis of the perceptron.
This is element wise, so we can pass the entire matrix to the function and it is applied to every element of the matrix and it returns the output. This function is often called the activation function and is denoted by 'f'.

Activation function step

Step 3: Grade your output

We've obtained the predictions 'v' and we have the outputs 'y' in the training set. Now we need an assessment function. This function will basically give us an estimate of how good our perceptron is doing, how close to the actual outputs are our predicted outputs. This function essentially gives us the error measure of out perceptron, hence it's called an error function or the cost function and it returns a single number (the error value) based on which we grade our outputs. We denote the cost by 'J' which is a scalar (single value) and the cost function by 'G'.

Cost function step

Step 4: Change the weights to minimize the cost (Gradient Descent)

This is possibly the step that incorporates the learning part of machine learning. We basically calculate the change in the weights that we need to make in order to minimize the cost 'J'. The partial derivative of a function gives you the direction of ascent which is also known as the gradient vector (the change in input to increase the output), we need the cost to descent, hence we move in the opposite direction proposed by the gradient. This algorithm is called Gradient Descent.

Given below is a graphical image depicting this in 2D, but the same concepts are applied to higher dimensions. The partial derivative equation is also shown below.

Gradient Descent in 2D

For more on gradient descent, check "Additional resources -> Gradient Descent" section of this page.

The derivative equation

You can visualize how to move by taking any equation as an example and trying to reach to the minima through this iterative method.

Rule to continuously update the weights. Alpha 'α' is also known as the learning rate. It decides how big steps you take at once.

Why the Bias?

Okay, we finally discuss this. We've seen that the first column of the inputs 'x' consists of all 1s, I told you the reason behind this is to make a 'bias'. Well, the first weight (w1) is known as the bias. With the bias, we give the perceptron the ability to perceive or generate an output even when no input is given (when all the features in your input are 0s). There are many applications where there exists a bias, for example:

Taxi fare: There is usually a fixed charge, and then some amount per unit wait time and some amount some unit distance.

So basically, we're enabling the perceptron to figure out an output even when there's a zero vector given as an input.

The above processing steps are performed iteratively over an input dataset to get weights that eventually give more and more promising outputs (basically lessen the cost).

Let's analyze different cost functions and activation functions

Different Functions

Here, we'll discuss about the different activation functions and cost functions used. These concepts are not just for perceptron, we can extend these to even more sophisticated algorithms, as we'll see later.

Activation functions

As discussed in step 2 above, when we obtain the weighed sum, we pass it through a function known as the activation function. The basic job of this function, just like any function, is that it takes an input 'z' (weighed sum) and returns an output 'v' (predicted output).

Note that these functions are applied element wise, that means to individual elements of the array, independently. This actually increases computation efficiency because you can do tasks in parallel now.

Signum or sign function

This function is basically the sign of the number entered. If the number it receives is -ve, it returns a -1, else it returns a +1. Since it has both +ve and -ve outputs, it's bipolar and since the output's are only two states, it's binary as well. Such functions are called bipolar binary functions.

Signum function: A binary bipolar function. It is also called the hard limit symmetric function

Hard limit function

It's same as signum, but instead of -1, it returns 0. This makes it polar (only +ve) but it still remains binary. Thus such functions are called unipolar binary functions.

Hard limit unipolar binary function

Sigmoid function

This function compresses the entire input space non-uniformly between 0 and 1. This thus gives a one to one mapping and is polar. Also, usually lambda '' is taken as 1 for convenience. Since this yields a continuous output. Such functions are called unipolar continuous functions.

Sigmoid activation function

Bipolar Sigmoid function

It's the same as above, but it goes from -1 to +1 instead of 0 to +1 like the sigmoid. It's equation is scaled and adjusted for this, no other big difference. Since this is bipolar, such functions are called bipolar continuous activation functions.

Bipolar Sigmoid activation function

There are several other activation functions like the tan hyperbolic or the tanh function and rectified linear unit or the relu function. We'll see some more later on...

Now, let's talk about some cost functions

Cost functions

These are functions that give us a number which measures the performance of our hypothesis, comparing it to our actual outputs. In essence, it takes in two things, the actual desired outputs 'y' and the predicted outputs 'v' and returns the cost value 'J' which is a single scalar value.

Let's look at a few cost functions.

Mean Squared Error

This is essentially the squared deviation of our predictions 'v' and our desirable output 'y'. As you can see, we can calculate this using a summation or a vector multiplication (the latter approach)
This is particularly useful when our output is in a wide range. Like, housing price predictions.
Notice how we can calculate a vector del having m rows and 1 column by a subtraction operation. Then we can calculate the cost 'J' using a matrix multiplication.
You might not see the '2m' in the denominator at some places, it would be 'm' instead. This is also not wrong, the 2 is there just to make things easier with the differentiation as we'll see later.

Mean Squared Error cost function

Cross Entropy Error

This is usually for models where we want a boolean output (0 or a 1).
You can notice if the actual value 'y' is 1 and the prediction 'v' is close to 0, then the cost for that training example (recorded in del vector at index i) would be very high (since -log(0) -> infinity). The same will hold true if the desired output is 0 and prediction is close to 1 (we'll be using the second part of del equation then). This is how it keeps the values in a check.
This can also be done using the vector notation as shown below. The del vector is essentially a vector storing individual cost values of each and every training example.

Cross Entropy cost function

This is how you can compute the cost, essentially.

Additional resources

Wikipedia

Vector calculus and Matrix calculus

Gradient Descent

Here, we'll gain some intuition into the working of gradient descent algorithm. Let's explore Gradient Descent in detail by understanding the following figure

Forget about the conventions for a moment here, we'll speak just in terms of mathematical functions and variables now. These are the details about what's happening here:

This is the function sketched in blue, the function whose minima we want to get to. We pass it a value and it returns a value, all such points are denoted by circles in the above graph on the blue curve.

This is the derivative of the function 'f' defined above. It also is a function which takes a value and returns a value. It's actually the slope of the dotted lines, the coefficients of 'x' in the line equations. It essentially gives you how uphill steep the graph is at a point.

Updating rule in Gradient Descent

This is the essential process of gradient descent. The value of 'x' in the next iteration (x[i+1]) is determined by:

Negative of the derivative at the current 'x' value (x[i]) (we need to get downhill).
Alpha value ('α') which controls how fast we move at once. This is called the learning rate.
Current value of 'x', (x[i]).

Note that x[1] is chosen at random, it's 3 in the example above. This is called random initialization.

So now that we have the mathematical knowledge, here are the things happening in the example above:

Select things like:
1. The number of times we want the entire process to run, which is called the number of iterations.
2. The learning rate for the process, which is denoted by alpha, 'α'.
We first plot a graph for the function 'f'.
We start by picking a initialization point on the graph. In our example, it's the first circle that you can see (x = 3, f(x) = 18).
Make a line with slope equal to the derivative of 'f' at that point passing through it. In our example, it's the first dashed line (depicted using y = 8x + -6). Such a line is also called a tangent.
The next point is determined by the updating rule as described above. They are represented using circles of darker shades of red on the graph. We sketch the tangent line for the point as well, depicted with a dashed line having the same color and just below the point in the legend (equations list).
We repeat the above step (step 5) till the number of iterations is done.

That's all happening here. I want you to focus on a few more things:

No matter where we start from, we eventually reach the minima.
There's no need to change the value of the learning rate as we keep coming closer and closer to the minima, the gradient automatically reduces in magnitude, thus reducing our step size.
The final point where the algorithm seems to stagnate is x = -1, which is the minima for the graph. So this algorithm can be used to find a minima, given an equation and it's derivative function.

Know more

Programming

Program your own perceptron from scratch in MATLAB or Python here

Page updated

Google Sites

Report abuse