Image taken from http://www.clipartbest.com/clipart-KinXggxkT
Image taken from http://www.clipartbest.com/clipart-KinXggxkT
Date published: 21st April, 2024
Machine Learning | Decision Trees | Random Forests | Classification | Regression | Scikit-Learn | Python
Understanding Decision Trees through Scikit-learn’s Methods
Decision trees are one of the simplest and most intuitive algorithms among all the machine learning algorithms. They can be used for both classification and regression problems. Despite being simple, they are extremely useful and have many advantages over many other popular algorithms like linear/logistic regression and SVMs. It is non-parametric and supervised and the built model is in the form of a tree (surprise surprise!), consisting of nodes, namely, the root, the leaves and the intermediate nodes, and branches (or edges) connecting these nodes.
The following are some of the advantages of using decision trees.
Highly inexpensive in terms of computational power
Easy to interpret if the built tree is relatively small
Fast in both training and prediction
Can handle any type of data, any relationship between the input variables and any relationship between the input and output variables
Missing values need not be handled prior to feeding the data to a decision tree
Quite robust to outliers in the input space
Scaling of data not required while feeding the data to this algorithm
Quite robust to outliers in the input space
Irrelevant features, even though they add to the computational time, does not really affect the output of a decision tree
Scikit-learn, one of the most popular libraries used for machine learning, contains methods to build decision tree classifiers as well as regressors. The goal of this write-up is to understand the concepts of decision trees as we walk through the arguments of these methods. The definitions of the decision tree classes in scikit-learn are as follows:
It can be noted that the arguments of both these classes are the same, while the parameters could vary based on the type of the decision tree.
Now, let’s understand how a decision tree is built. As mentioned earlier, a decision tree consists of a root node, intermediary nodes, and leaf nodes. The data flow starts from the root node, passes through the intermediaries, and ends at leaf nodes, or leaves. Each node, except the leaf nodes, are split into two or more child nodes based on a criterion generated from any of the input features.
The foremost step in building a decision tree is to identify the criterion at the root node. This is done based on the quality of the split made by each feature criteria, i.e., how well each of the criterion splits the target space. This can be quantified using various metrics like Gini impurity and entropy in case of classification, and squared error in case of regression. This choice of metric can be made through the criterion argument in the scikit-learn’s function definition. The function supports two strategies to make this search – “best” and “random”, which can be passed through the argument splitter. The “best” splitter goes over each of the features and comes up with the feature criterion that gives the best split based on the criterion. This has a computational overhead of going through all the features before choosing the best split criterion. The “random” splitter uses only a subset of the features (chosen randomly based on the value of max_features) to determine the best criterion.
Once the criterion at the root node is chosen, the split is made, and the tree grows to a depth of one. The algorithm then repeats the same process of criterion identification for each of the child nodes and hence the tree grows its branches.
The prediction of the target value of a node can be made by using a majority vote or probability in case of classification, and by taking the average in case of regression of the targets of the data points in that node.
If unchecked, a node would not split only if all the data points in it belong to the same target class. Hence decision trees are highly prone to overfitting, i.e., they might not generalize well to unseen data.
There are methods designed to overcome this major drawback. For instance, one can decide to not let the tree go beyond a max_depth or specify the min_samples_split parameter that prevents the tree from making any more splits, if the split is to be made on a node with fewer than the specified number of data points. Similarly, one can use the min_samples_leaf or the min_weight_fraction_leaf to prevent the tree from having leaves with fewer than the given number of data points or a given weighted fraction of total input weights, respectively. The max_features argument, in addition to being used for criterion candidate selection, can also be used as a method to prevent overfitting. The min_impurity_decrease argument regulates the growth of the tree by stopping the split of a node if the split induces a decrease in the impurity less than the specified value. The max_leaf_nodes parameter lets a tree grow with only up to a given number of leaf nodes, in best-first fashion, i.e., the using the top impurity reducing nodes.
Additionally, (tree) pruning is a method that reduces the tree’s size by removing the nodes or rules that are relatively weak and less effective in predicting the target. This can be done both while and after building the tree. ccp_alpha is the complexity parameter used to manage the pruning.
Even after trying out all these methods, decision trees may still overfit and fail to generalize well on unseen data. They are notorious at having high variance to new data points. A powerful method that imbibes all the good features of decision trees and additionally, generalize better, is the method of having an ensemble of trees, i.e., a forest.
Even though ensembling can be done on most of the algorithms, decision trees are best suited for this as they are computationally inexpensive and have high variance, which are the major determinants of the efficiency of ensembles.
There are many powerful algorithms based on decision tree ensembles which have wide variety of use cases. Hence understanding the building blocks of these ensembles – the decision trees, is crucial. It is like planting actual trees to get the enormous benefits of forests in the future.
Relevant Links:
Scikit-Learn's Documentation on Trees
Scikit-Learn's Documentation on Decision Tree Regressors