A decision tree is an ML model that splits the samples into some sub-populations to better categorize them, and give higher-resolution information about the whole population. These rules are based on the questions based on feature specifications. Among all rules that can create a split, one that carries more information will be prioritized.
The main characteristics of the decision trees are:
There are decision trees for both Classification and Regression
Trees are Rule-based and Interpretable, which means one can expect to express the splits as follows:
If A and B or C ⇒ D
Based on the most significant splitter one splits the population
Let us see how we will form a tree from the example in the introduction of the classification section. At the moment, we consider using no question for splitting the population. In the data, there are three characteristics (here features) that we can base our rules on them.
Let us first consider the "marital" status. What we do, is as follows. In the first box, we write the number of people, 1000, in the brackets, and underneath we show how they are split into two classes by using different colors: 900 (black) no-conversion and 100 otherwise (red). Note that we can also consider the same example for credit risk where no-conversion means creditworthy, and conversion means bad credit. Now according to marital status, the population will be split into two groups: married and unmarried. The number of married individuals is 700 shown in the box on the left, and the number of unmarried is 300, shown in the box on the right. Now we repeat the first step for each box (sub-group): we write the number of people in the brackets: (700) married, (300) unmarried. For instance, 650 of the married individuals (left box) do not convert and we write it in black, and for the rest 50 we write in red.
Repeat the same process by asking about their employability for the boxes of married and unmarried people. So, we have four boxes where the number of people in each box is based on no-convert and convert, black and red, respectively. Now we can produce the probability of no-conversion and conversion associated with each final box in a table. For instance, let us look at the box on the far left. As one can see the number of people in this group, that is the people who are married and employed, is 390. Among them, 10 will convert and 390 will not. Therefore, for people who are married and employed the probability of conversion is 1/40 and otherwise is 39/40.
There are a few points that we must consider.
We always better keep the same direction for the ‘Yes’ and ‘No’ answers. Here moving to the left means ‘Yes’ answer and moving right means ‘No’ answer.
The first box is called the root, then we have branches, and finally, we end up with leaves (the final boxes).
This tree is of depth two since it only asks two questions.
There are three main questions that we need to ask ourselves now,
How we can use this tree to make a prediction and classification?
Could we have started from a different question, for instance starting from education or employability, instead of marriage, resulting in more accurate results and carrying more information?
Is it worth carrying on and having a deeper tree?
First, let us see how we can use the tree to make predictions and classifications to answer the first question. Let us assume that we want to focus on the no-convert status; so, we are interested to see if somebody’s status is no-convert or otherwise. For instance, if we see a new customer that is married and employed, what is our prediction if he/she converts or not? So, looking at the table first of all we need to find the probability for convert and no-convert for such a person that from the tree is
P(no-convert)=39/40.
So it seems this person is very unlikely to convert, but without having any probability threshold we cannot say if this probability really means no-convert. So let us consider a probability threshold of 0.5. This means if the probability of no-convert is above 0.5 then the prediction is no-convert (or 1).
Answering the second and third questions is not so easy. For instance, let us consider cases in which we want to expand more the tree and produce a larger tree with more leaves. It is clear that the tree also gives more accurate results as it has more branches. However, it is not clear how much more information this new tree carries compared with the previous one. Is it worth it to do this step and even go further?
Here, we have a fully developed tree where all information is used. The question is, do we really need to move forward and grow the tree to this stage?
Just to recall, except for the predictability, there are two main questions that we need to answer when organizing a tree:
Which question do we need to ask next?
How deep do we need to go?
Here we are introducing two criteria for splitting any leaf in a tree. This means if we continue to ask a new question, how useful that question would be? The first measure is called the Gini index and the second one is called Entropy. When we split a leaf by asking a question, then the split is a good one if the probabilities that we can be produced by that (like what we have done in the previous slides) are either very close to 1 or 0; since this shows the split is very efficient and can strongly classify the population. Both the measures above are introduced to measure this fact. For instance, the Gini index is a summation of all p and 1-p across the split (it might be more than two, unlike what we had in our example) that need to be close to 0 (since then it shows either p or 1-pis very close to 0). So, if we compare the Gini indexes for two different splits, the one that gives a smaller Gini index is preferred. Let us then introduce the following way of growing a tree: find the question that can give the smallest Gini index, then find the next leaves, and for any leaf repeat the same process. We continue this process until we find out that adding more leaves is not significantly changing the tree to a better result.
In order to use this as an algorithm,
Start with a feature.
Use the feature to classify the data.
Find its Gini (or entropy) index.
Choose the feature that gives the lowest Gini and set it as the root of the tree.
Repeat the process at each leaf of the primary tree and go on.
Here we look again at the example that we studied earlier. Let us assume we only want to develop a tree of depth one. We have three features, marriage status, employability, and education. We split the data based on all features and find the numbers from the table that introduces the tree. For each one, we find the Gini index in the way that we explained in the previous part. It is clear that the smallest Gini happens when we split the population based on education. This is unlike what we have done earlier when we started with marital status.
Now after picking education as the first feature, we have two new leaves. We repeat the same process for each leaf now. As one can see now the best question to ask is about employability. However, this is a coincidence that the best features in any leaf are the same. It could have happened that one best split was employability and the other one was marital status.
So the best tree with two layers of depth is presented as one first asking the education, and then employability. Now we want to compare this tree with the one that we discussed in the beginning. The best way to do that is to draw and compare the ROC curve of both trees.
Now let us draw the ROC curve of the efficient tree of depth two. As it has been discussed, the ROC curve consists of the points in a two-dimensional space with the coordinates given by
x = FP / N = FP / (FN+TN)
and
y = TP / P = TP / (TP+FN),
for different probability thresholds. But it is not efficient that we do them for all probability thresholds. If we look at the tree, we see that we have only four probabilities at which we can use for classification. So it seems we need to choose only those four probabilities as thresholds for our analysis. That is why we draw a table where these four probability thresholds are set for the columns and rows and we find TP, FN, FP, and TN for each probability. To see that we find all the numbers for each row, and from the tree, we try to find out how the probability threshold can classify the samples. For instance, consider the first row. In this row, we have set the probability threshold at 60/80. Now starting from the box on the bottom left, the probability of conversion is 3/392 (the red ones), which is much smaller than 60/80. As a result, if we have an individual with features of this box (educated and employed), then the prediction is 0. So, we put this number in the first row, on the far left column. The same can be done for other columns. Now when we have these numbers in the first row, we can find TP, FN, FP, and TN. Let us see how this can be done. For instance, let's say we want to find TP (true positives). Starting from a far-left column of the first row, one can see that the model does not predict anyone to be 1 (positive), so we have zero TP, here. The same is true for columns 2 and 3. But for the fourth column, the prediction is 1 (positive). As one can see in the associated box, we have a population of 60 people that are 1 (positive), which means the true positive is 60. Let us repeat it for the second row too. In the second row, for the first and the second column, we have predictions 0, but in the third and fourth columns, we have predictions 1. In total we have 17+60=77 positive cases in these two boxes; so the TP is 77.
After we find all values for TP, FN, FP, and TN, we can find
(x,y) = (FP / (FP+TN), TP / (TP+FN))
for each row which gives points on the ROC curve. Now we can plot the ROC curve by interpolating these points.
Now let us do the same thing we have done in the previous slide for the non-efficient tree that we started at the beginning of this section. We have the plot of the ROC here.
Here we put both ROC in one plot and observe that the first tree that has a more efficient splitting has a better ROC curve as it is more towards top left. This is what we expected.
This is an exercise that you can check on your own.
This is the complete three where all three features are used for classification, so all the information is used.
There are two ways to reduce a tree size for regularization purposes. Either we realize some leaves or branches cannot add lots of information and remove them after they are grown, or we put a limit on the depth of the tree.
Both methods are identified as regularization methods. The latter is a bit easier as the depth of the tree is only decided by the decision-maker. The former, which is called pruning, is a little bit more technical. For pruning, one needs to grow the tree, and then go back and start removing the leaves and branches that add the least information. This means one finds the leaves that have the least entropy or add the largest Gini. One continues this process until the level that the decision-maker feels pruning starts removing crucial information.
Even though a tree is a simple and efficient method that is also interpretable, there is always a risk of overfitting in using trees. They consider all cases to grow a tree. For that reason, the bagging method is proposed, which can reduce overfitting and variance. A popular method to deal with this problem is bagging. Bagging is a method that can be used in any other ML method. But here we only present it for trees. The idea is as follows: let us consider a number n′ smaller than n, the number of the training set (for instance n′=0.6n ). Then, from the n rows of the data set, we start to randomly pick n' new rows with replacement. We do this M times. For each new data set, we train a tree. When we are given new data, we present it to all those M trees and record the output. If the model is a classification (like a classifying tree), we vote the results and take the one with more votes. If it is a regression (like regression tree which will be explained later), we take the average.
Bagging is bootstrapping on the training set, to reduce variance (σ^2/n) by voting or averaging.
There is another method that improves the bagging strategy, and that is called random forest. This method is similar to bagging to do re-sampling on the data, but in addition, it picks randomly m≪ p features for any tree, where m is given in the beginning. For instance, in this slide, you can see only 3 features, which are randomly chosen for any tree, are chosen for training trees.
Random forest is bootstrapping for two concepts: on the training set (like bagging), and on features for m≪ p. There is no pruning here.