Decision trees are an important machine learning algorithm. To classify an instance, they create a flowchart-like structure with each node as a test, creating "branches" for each outcome. When a "leaf" node is reached, the algorithm classifies its output.
To build a decision tree, we must quantify how good a particular node is. We use Gini impurity, which is defined as how often a randomly chosen element would be mislabeled. Nodes with a low Gini impurity are good. To start, we calculate the Gini of each feature and choose the best one as the "root" or starting node. Then we continue on until the tree is complete.
Doing this by hand is very time consuming. Thankfully, the scikit-learn library can do this very easily with a built-in decision tree function.
A CSV file of Titanic passengers is provided on the GitHub. We will try and predict whether someone survives this tragic incident with data on their background. The selected features are age, class, siblings onboard, and sex, and the algorithm attempts to classify them as a survivor or a casualty.
The first problem I ran into was what to do with null values, since some passengers had unknown ages. There is no good way to deal with missing data in a decision tree, so I had to remove them from the sample. Next, I split the data into a training set and a test set, with a 70/30 split. After running the data on the training set, I got this massive tree:
This complicated tree seemed too good to be true; I suspected overfitting. Overfitting is when a machine learning model predicts the training data extremely well but fails to predict the test data. Once I compared the predicted results to the actual results, my suspicions were confirmed.
Overfitting is often caused by a solution that is too mathematically complicated, so I decided to make the tree smaller by setting a limit to the number of leaf nodes. I plotted the number of test samples correctly classified versus tree size after testing a variety of tree sizes.
Believe it or not, the most accurate tree size was one with just three leaf nodes! This was the best tree.
The best tree only had two decisions: If you were male, you died. If you were female, and second or first class, you survived. If you were female and third class, you died. Simplicity is best!