Perceptron is a supervised learning algorithm that is used for simple linear regression. This basically means, if your dataset has a linear pattern, then this algorithm can detect it. It is the most simple algorithm and easiest to make. Here, we'll discuss the mathematics behind the perceptron, and use it to fit a line through some data that we have (which is the simplest form of regression).
Before we get started, it's suggested that we understand a few conventions. These may differ from source to source, we'll be using the following conventions for this page. Keep a note of them, don't worry if you don't understand everything right away. You can always scroll back to these lines...
As you can see, the inputs are denoted by 'x' and they're represented as a matrix having 'm' rows and 'n' columns, with each column representing some feature. Since these inputs will be used to train our algorithm, these are also called the training examples. Thus, the first three conventions are clear now. Bear with me here, I'll explain the +1 in 'n' value later in this same page when we discuss the working, but for now, understand that the first column is generally all 1s and then the proceeding columns hold the actual data (we'll call this a bias column and I'll explain this later in this same page).
Since we're doing supervised learning, we need labelled data. By labelled data, I mean that the training examples must have a labelled and decisive output. These are called the desired outputs or actual outputs and are denoted by 'y'. As you can observe from the image, they're a matrix having 'm' rows and 1 column (they can have more, but for a single perceptron, you can have only one output).
Now that we understand how a dataset looks and how to represent it as a matrix, let's understand the working of a perceptron.
First, have a look at the image below. We'll discuss about what all of this means but just keep a note of it.
Mathematical model of a perceptron
In the above diagram, 'w' stand for the weights of the connections. Every feature, including the bias column is assigned a weight, think of it like the amount of importance to that feature the perceptron gives in deciding the output. The output of the perceptron is denoted using 'v'. Keep these in mind as we go ahead
Activation function step
We've obtained the predictions 'v' and we have the outputs 'y' in the training set. Now we need an assessment function. This function will basically give us an estimate of how good our perceptron is doing, how close to the actual outputs are our predicted outputs. This function essentially gives us the error measure of out perceptron, hence it's called an error function or the cost function and it returns a single number (the error value) based on which we grade our outputs. We denote the cost by 'J' which is a scalar (single value) and the cost function by 'G'.
Cost function step
This is possibly the step that incorporates the learning part of machine learning. We basically calculate the change in the weights that we need to make in order to minimize the cost 'J'. The partial derivative of a function gives you the direction of ascent which is also known as the gradient vector (the change in input to increase the output), we need the cost to descent, hence we move in the opposite direction proposed by the gradient. This algorithm is called Gradient Descent.
Given below is a graphical image depicting this in 2D, but the same concepts are applied to higher dimensions. The partial derivative equation is also shown below.
Gradient Descent in 2D
For more on gradient descent, check "Additional resources -> Gradient Descent" section of this page.
The derivative equation
You can visualize how to move by taking any equation as an example and trying to reach to the minima through this iterative method.
Rule to continuously update the weights. Alpha 'α' is also known as the learning rate. It decides how big steps you take at once.
Okay, we finally discuss this. We've seen that the first column of the inputs 'x' consists of all 1s, I told you the reason behind this is to make a 'bias'. Well, the first weight (w1) is known as the bias. With the bias, we give the perceptron the ability to perceive or generate an output even when no input is given (when all the features in your input are 0s). There are many applications where there exists a bias, for example:
So basically, we're enabling the perceptron to figure out an output even when there's a zero vector given as an input.
The above processing steps are performed iteratively over an input dataset to get weights that eventually give more and more promising outputs (basically lessen the cost).
Let's analyze different cost functions and activation functions
Here, we'll discuss about the different activation functions and cost functions used. These concepts are not just for perceptron, we can extend these to even more sophisticated algorithms, as we'll see later.
As discussed in step 2 above, when we obtain the weighed sum, we pass it through a function known as the activation function. The basic job of this function, just like any function, is that it takes an input 'z' (weighed sum) and returns an output 'v' (predicted output).
Note that these functions are applied element wise, that means to individual elements of the array, independently. This actually increases computation efficiency because you can do tasks in parallel now.
This function is basically the sign of the number entered. If the number it receives is -ve, it returns a -1, else it returns a +1. Since it has both +ve and -ve outputs, it's bipolar and since the output's are only two states, it's binary as well. Such functions are called bipolar binary functions.
Signum function: A binary bipolar function. It is also called the hard limit symmetric function
It's same as signum, but instead of -1, it returns 0. This makes it polar (only +ve) but it still remains binary. Thus such functions are called unipolar binary functions.
Hard limit unipolar binary function
This function compresses the entire input space non-uniformly between 0 and 1. This thus gives a one to one mapping and is polar. Also, usually lambda '' is taken as 1 for convenience. Since this yields a continuous output. Such functions are called unipolar continuous functions.
Sigmoid activation function
It's the same as above, but it goes from -1 to +1 instead of 0 to +1 like the sigmoid. It's equation is scaled and adjusted for this, no other big difference. Since this is bipolar, such functions are called bipolar continuous activation functions.
Bipolar Sigmoid activation function
These are functions that give us a number which measures the performance of our hypothesis, comparing it to our actual outputs. In essence, it takes in two things, the actual desired outputs 'y' and the predicted outputs 'v' and returns the cost value 'J' which is a single scalar value.
Let's look at a few cost functions.
Mean Squared Error cost function
Cross Entropy cost function
Cross Entropy cost function
This is how you can compute the cost, essentially.
Here, we'll gain some intuition into the working of gradient descent algorithm. Let's explore Gradient Descent in detail by understanding the following figure
Forget about the conventions for a moment here, we'll speak just in terms of mathematical functions and variables now. These are the details about what's happening here:
This is the function sketched in blue, the function whose minima we want to get to. We pass it a value and it returns a value, all such points are denoted by circles in the above graph on the blue curve.
This is the derivative of the function 'f' defined above. It also is a function which takes a value and returns a value. It's actually the slope of the dotted lines, the coefficients of 'x' in the line equations. It essentially gives you how uphill steep the graph is at a point.
Updating rule in Gradient Descent
This is the essential process of gradient descent. The value of 'x' in the next iteration (x[i+1]) is determined by:
Note that x[1] is chosen at random, it's 3 in the example above. This is called random initialization.
So now that we have the mathematical knowledge, here are the things happening in the example above:
y = 8x + -6
). Such a line is also called a tangent.That's all happening here. I want you to focus on a few more things:
x = -1
, which is the minima for the graph. So this algorithm can be used to find a minima, given an equation and it's derivative function.