Regression Trees

The total length of the videos in this section is approximately 10 minutes. Feel free to do this in multiple sittings! You will also spend time answering short questions while completing this section.

You can also view all the videos in this section at the YouTube playlist linked here.

Introduction to Regression Trees

RegressionTrees.1.Introduction.mp4

Question 1: When the regression tree algorithm is choosing among possible ways to split the units, it calculates the means of the outcomes for the units in each possible pair of nodes. Which is the goal?

Choose a split such that the means of the outcomes are as similar as possible for units in the two nodes
Choose a split such that the means of the outcomes are as different as possible for units in the two nodes

Show answer

The second option. The goal is to predict Y as closely as possible, given the value of X. If we choose nodes that are as different as possible on the mean of Y, then the units within each node will be as similar as possible in terms of Y. If we had created nodes with similar means, then the tree wouldn't be much better at making predictions than just using the overall mean of all the data points.

Tree details

RegressionTrees.2.Details.mp4

This video uses the word "residuals," which refers to the difference between a unit's Y value and the value of Y that the model would predict for future similar units. For example, if a person in the data set lives for 10 months in a medical study, and the mean for others in the same node of the tree is 8 months, the residual for me is 10-8 = 2. Most modeling techniques, parametric or non-parametric, attempt to minimize residuals. This relates to the definition of variance, which is the mean of the squared residuals.

Question 2: Suppose that three people who are similar on background characteristics (eg, Wellesley alums from the same graduating class who majored in data science and have not gone to graduate school) have incomes of $70,000, $80,000, and $100,000. What income would you predict for a new person with the same background characteristics, if your goal is to minimize the maximum absolute residual?

Show answer

The answer is $85,000.

As soon as you have a predicted value, you can calculate what the residuals would be for the data you've seen so far. In this question, if my prediction is 70 thousand, and the three data points I've seen are 70, 80, and 100 thousand, then the residuals are 70-70=0, 80-70=10, 100-70=30. These are positive, so the absolute values are the same. So, the max absolute residual is 30.

Suppose that, instead, our prediction is 85 thousand. Then the absolute residuals are abs(70-85) = 15, abs(80-85) = 5, and abs(100-85) = 15. The max is 15. If our goal is to find the smallest max abs residual, then 85 is a better prediction than 70 was, because 15 < 30.

We can see that it's not possible for the max abs residual to be smaller than 15, because 85 is exactly between 70 and 100. If we choose something bigger than 85, it'll be further from 70. If we choose something smaller than 85, it'll be further from 100.

Again, the absolute residuals of the first and third data points would be $15,000 (and the absolute residual of the middle data point would be $5,000). If we choose a larger prediction, the first absolute residual will be bigger than $15,000, and if we choose a smaller prediction, the third absolute residual will be bigger than $15,000. Therefore, $85,000 must be the prediction that minimizes the maximum absolute residual.

It is worth noting that most statistical methods do not try to minimize the maximum absolute residual, but rather than sum of the square residuals. There are reasons for this, as we will see, but I want you to understand that we could choose our predictions based on non-standard criteria for the residuals, such as minimizing maximum absolute residual.

Classification trees

RegressionTrees.3.Classification.mp4

Question 3. Suppose your outcome variable is race. Would you use a regression tree or a classification tree?

Show answer

Classification tree, because race is a categorical variable. Regression trees are for numeric outcomes, and classification trees are for categorical outcomes.

And that is all.

During this tutorial you learned:

How to understand regression tree and classification tree models
Reasons why predictor variables might not be included in a final regression tree model
The definition of residual

Terms and concepts:

Regression tree, classification tree, nodes, residuals