In this part, we are going to review a set of popular supervised learning methods that can extensively be used in finance and insurance. However, the list is not limited to the ones reviewed here.
Supervised learning is a group of different algorithms that try to characterize a relation between inputs and their associated outputs. For instance, let's consider a survey on people's income where we have a number of people with different characteristics like age, education, gender, etc, where we associate the income with each person. Supervised learning tries to find a relation between the characteristics (here age, education, gender, etc) and their associated value or class (here income). Furthermore, this algorithm must make us able to have a legitimate prediction about new individual characteristics, and the real associated value (here income).
Definition. In ML the input characteristics of the individuals are called features and the associated output(s) is called a label. Features are also known as inputs, independent variables, or predictors; and labels are known as outputs. Supervised learning finds a relationship with a good forecasting power that maps features to labels.
Supervised learning is categorized into two sub-categories:
Regression: the label takes a continuous value (or a real vector).
Classification: the label belongs to a class (usually two classes 0, 1).
Looking at the picture, on the left, linear regression is presented, where one tries to find a linear relationship between the features, and the labels by fitting a linear line. On the other hand, on the right-hand side, there are two classes of red and blue objects that a model tries to classify two.
Regression is a type of supervised learning where labels take continuous values. For instance, let us consider that we want to make a good prediction about the price of an asset. We can consider the major economic factors, such as inflation and unemployment rate, etc, and the price of the same asset in the last seven days as inputs (or features), and the asset price as the output (or label). We are essentially interested in finding a relation between the factors and previous asset prices to predict today's asset price. As one can see the label which is the asset price gets a continuous value.
In classification, the aim is to predict the class of a given sample. Classification is a method where the labels belong to two or more classes. Usually, one can consider two classes, as any multiclass analysis can be done with two-class algorithms.
A classifier is a function that is trained to correctly distinguish two or more different classes (e.g., 0,1), given the inputs. In mathematical terms, a classifier is a function from the individuals' features to 0 and 1.
But in practice, a classifier returns a probability and not just numbers 0 and 1. To have a real classifier we usually round the probability up if it is greater than 0.5 and down, otherwise.
For instance, consider the same example of asset price prediction, but this time we only want to predict the direction of the asset price. This way we can design a trading strategy, by buying the asset if the prediction is that the prices go up, and selling, or shorting if prices go down. Here we have two classes, 0 which is the downward price direction, and 1 which is the upward price direction. Now our task here is to predict the direction of the asset prices given all the economic factors and their last seven-day prices.
Any ML algorithm has many parameters that need to be estimated. However, as it is discussed in the introduction the beauty of ML is to reduce a complex parameter estimation practice to a one-dimensional problem: to measure the degree of model complexity. The model complexity can be represented by a few more parameters that need to be set before estimation or training. These parameters are called hyperparameters. Let's discuss an example. Consider a linear regression with n independent variable (now known as features). One can introduce more features by considering the product of different features, which results in higher-order features. The value of the order is a hyperparameter.
In the following, we discuss how we can measure the goodness of both parameters and hyperparameters.
Here we discuss how to partition the data set, for training (estimation), validation, and testing. The whole data set will usually be divided into three parts:
A larger part is called the training set and is used to fit the model or train it.
The second part is smaller and is called the validation set. This part is used to validate the model. A validation set is used to validate the model in the range of different model complexities. In other words, for different models with different complexities, we use the in-sample and the out-of-sample error (from the training and validation) to find an optimal complexity value (validate the model).
Finally, we have a third part of the data which is called the testing set. This part is used to test the validated model to see if it really performs well.
The split can be done in different proportions, either 60/20/20, 70/15/15, or 80/10/10 percent.
Here you can see a schematic view of a supervised learning model. In the first n row, we map p features (predictors) to the labels (outputs). These n rows will be used to train a model that minimizes the error (i.e., the expected loss), and then one needs to validate it on the validation set. When the model is validated, it will be encountered in the test set to be assessed.
The main objective of the ML is model validation. As we have seen, models can be validated on the validation sets. But a more complex and comprehensive way to validate a model is to use K-fold cross-validation. In this method, the data is partitioned into K equal parts (for instance 10), and then the model is 10 times trained on K-1 data and validated on the remaining part. Finally, we take the average of the training and validation errors, as the total error.
In the picture we summarize the ML process in a few pictures: on the left-hand side, the model is trained and validated. If the model is validated it is tested in the middle. One can also use a more sophisticated method like K-fold cross-validation for validation.
First, the data is imported, then it will be partitioned into training and model validation sets. Usually, the split is around 10-90, 20-80, or 30-70 percent. This helps us to find the hyperparameter.
Then the model is trained and for validation, it will be tested. This is how we can score the model at the end.
One step further, we can also apply the k-fold cross-validation for validating the model. So the process of validation must be repeated k times and then in order to make sure that information is not leaked from the cross-validation process we can hold out a test set and test our model on that.
In an ML algorithm, we have to measure errors for the best prediction. Mean square error (MSE) is the sum of the error squares. If this is equal to 0, we can have the exact prediction. But in reality, we can only think of minimizing it as making it zero is impossible.
One of the main issues in ML is the so-called bias/variance trade-off. That is essentially an outcome of the fact that the mean square error can be written as the sum of the variance and the squared bias. If the total error for a model is bounded below, decreasing one will increase, the other.
A balance between the two while keeping the model well-fit on a training set, is believed to be a good way of making a good prediction.
The next major issue in ML modeling is overfitting/underfitting versus bias-variance. Overfitting means a good fit for the in-sample data and a bad fit for the out-of-sample data. Underfit means a bad fit for the in-sample data. As you can see in the picture, on the lefts hand side, a polynomial of degree 11 is fitted to the training set, which overfits the data. This model has a high variance and a very low prediction power. In the middle, we have an under-fitted linear model. As you can see, this model has a high bias. The last model, on the right-hand side, is a polynomial of degree 2 that is neither overfitted nor under-fitted. This model has a good balance that is neither high variance nor high bias.
A learning curve is any curve that can show how the learning process evolves with the evolution of the parameters. In particular, we are interested to see how the learning will evolve if we increase the number of data entries (like discussed above in Bias/Variance analysis), and the model complexity. This will help us to decide about the optimal number of data and also the complexity. It is interesting that adding to the number of data while the complexity is not increased, or using more complex models while the number of data entries does not grow, would not always help. In the former, we will face a high bias while in the second case we face the volatility problem. The optimal situation can be decided by looking at the learning curves.
One way to see how learning evolves is to see the evolution of the bias and variance with respect to the data size. As it is shown in the following picture, by increasing the data size the validation error decreases, which means the model can better perform out of sample and make better predictions. While at the same time, the training error increases by increasing the data size. This is a sign of overfitting. One can also observe that for the complex models, in the beginning, there is zero training error which makes sense. Finally, more complex models will decrease the total error, going from the back to the blue lines.
Here are the curves that show how bias and variance evolve with regard to the model complexity and data size. More complex models while decreasing the bias will increase the variance, and on the opposite side less complex models while decreasing variance will increase the bias. The balance will happen in the middle.
In this graph, you can see how the complexity, bias/variance and chance of overfitting/underfitting work together. Higher complexity (dimension), is mainly associated with higher variance and more chance of overfitting, while on the opposite side lower complexity can cause high bias and model underfitting.