Let us consider two types of objects in a two-dimensional space, namely blues (or triangles) and reds (or circles). A question is if we can introduce an algorithm that can perfectly split the two groups of the objects. There are many ways to do this, for instance, we can do this by using a curvy line (left), or a rectangle (middle), or an oval (right). As one can see the one on the left can give us a perfect split of blue and red objects. That would convince us to choose such an algorithm. But this cannot be very useful since this is only good for the existing objects. However, the aim of finding a good algorithm is to make us able to successfully classify new entries of the objects. For instance, the black triangle is assumed to be a new entry for the blue objects. As one can see even though the method on the left is doing a perfect job in splitting the primary data, it cannot correctly classify the new object like the other two. One then may ask, if there is a general rule behind the locations of the blues and reds, which algorithm can more successfully explore this general rule to predict the location of new entries.
In this section, we will introduce different methods that try to classify two groups of objects based on their features, in such a way that the algorithm can also make a good prediction.
For a problem with two classes (which covers most of the classification problems), a popular approach is to identify the classes by 0 and 1. Even though this seems to be a nominal change, it is very useful since it takes a classification problem into the field of probability. If a class, say A, is identified by 1, then one can interpret the value 0 as 0 percent probability of being inside of A, and 1 as 100 percent probability of being in A. This way one can think of a classifier as a function that maps features (predictors) to probabilities. However, this may cause a conceptual problem, that, when a model is fitted to the training set, then on the new data entry (e.g., validation or test set), the classifier no longer provides 0 or 1, but a probability measure. That is why, except for the classifier itself, we need to have a threshold, above which the classifier probability is identified as 1 and otherwise zero.
In order to have a good model, we understood that we have to reduce the MSE by applying methods that reduce Bias and Variance. For instance, we can use more data and more complex models. But if we just want to compare a few models that are fit to the data, we need some measures to make the comparison. While again we can just compare the MSE of two models, but for classification in particular we have lots of other measures. These different measures are introduced to reach different objectives. For instance, we can think of having more precise or more sensitive models. To see the differences, let us give two examples.
The first one is to classify patients with a particular disease (say cancer). If there is a classifier we really want the model to be sensitive, which means while we would tolerate some false-positive test results (not ill while diagnosed ill), we do not tolerate missing any false-negative case (ill while diagnosed healthy). The second example is about credit cards. A credit card company usually wants to have a good number of clients. They want to be precise, which means they would tolerate some false negative reports (people who have good credit but rejected), while the company will not tolerate false positive reports (low credit clients who are classified as good credit).
For sure we need to introduce an algorithm that can make a "good" classification. A "good" classifier, depending on the application can be either accurate, precise, sensitive (or a balance of the two), etc. These measures are called performance measures and include, accuracy, precision, specificity (or selectivity), sensitivity (or recall), F1 score, support. There are also charts including the ROC curve, gain and lift charts.
There are lots of different applications that we can relate to classification methods.
Examples can include, recommended systems, credit scoring, a trading strategy (e.g., returns would be positive), client analysis, fraud detection (e.g., for insurance claims), and many others.
Let's consider an insurance company that wants to use machine learning to purposefully make offers to grow the number of clients. This is a known problem, and to this end, we need to find a good classifier and come up with the gain and lift charts, which will be discussed later (here). However, in order to better understand the problem, we need to better understand how a classifier works. As explained in the introduction, a classifier is a function that is trained to correctly distinguish two different classes (test positive=1 or negative=0), given the input features of the real data. In mathematical terms, a classifier is a function from the individuals' features (characteristics like age, salary, gender, etc) to 0 and 1. But in practice, a classifier returns a probability and not only numbers 0 and 1. So, to have a real classifier we usually round the probability up if it is greater than 0.5 and down, otherwise. This means a trained classifier carries a lot more information: for instance, if for the features of two individuals A and B the classifier returns 0.6 and 0.9, respectively (i.e., both are identified as 1) A with 0.9 would be considered more likely to become a client.
Here is an example of a population of 1000 individuals, where the main objective is to predict the conversion status of a client. There are three features, education, employability, and marital status that can be either 0 or 1. The label is a conversion that can also be either 1 or 0. The last column shows how many labels are 1 or 0. In the next few slides, we will show how one can classify this population by using trees.
This example will be used in other sections.