Decision trees are one of the most commonly used supervised machine learning algorithms used in the industry. It can be employed for both classification and regression tasks. The main reason for employing Decision trees for classification tasks and regression tasks is due to its simplicity, interpretability, and ability to handle both numerical and categorical data. Unlike Naive Bayes and other classifiers, Decision Trees are not linear classifiers/separators. Another important point to note is that Decision Trees come under the category of non-parametric models, i.e. they can't be expressed as parametric equations, unlike linear regression.
In a decision tree structure, there are 3 different nodes.
Root Node
Parent Node/ Internal Node
Child Node/Terminal Node
The root node is the top node and is also called the base node. It doesn't have a parent node/incoming node but has zero or multiple outgoing nodes. The node that has both incoming nodes and outgoing nodes is called the parent/internal node. The node that has only the incoming nodes and no outgoing nodes is called the leaf node/terminal node.
Decision trees can be utilized for various purposes across multiple fields. Some of the common applications of Decision Trees include:
Fraud detection
Customer churn prediction
Medical diagnosis
Product recommendation
The GINI index measures the impurity of a set of data points. It ranges from 0 to 1, where 0 indicates that all elements belong to a single class and 1 indicates maximum impurity (i.e., equal distribution among all classes).
Entropy is another measure of impurity in a dataset. It calculates the disorder or randomness in the data. A lower entropy value indicates less disorder, while a higher entropy value indicates more disorder. Entropy also ranges from 0 to 1 where 0 indicates that all elements belong to single class and is highly pure and 1 indicates maximum impurity or disorderness.
Information gain measures the reduction in entropy or GINI index achieved by splitting a dataset based on a particular feature. It helps in selecting the best feature to split the data. The feature selection for node splitting is based on Information gain. If the information gain is high, then those features are selected for node splitting compared to the features that produce low information gain.
Node splitting is a critical process in decision tree construction, pivotal for effectively partitioning data into homogeneous subsets. Two primary methods for node splitting in decision trees are Information Gain (Entropy) and Gini Impurity. With Information Gain, nodes are split based on entropy, a measure of randomness or disorder within the data. The algorithm selects features that maximize information gain, which is calculated by comparing the entropy before and after the split. Higher information gain signifies a more optimal split, leading to more homogeneous child nodes.
Conversely, Gini Impurity evaluates the purity of nodes by measuring the probability of misclassifying a randomly chosen element based on the distribution of labels. The algorithm selects features that minimize Gini impurity, aiming for purer child nodes with lower probabilities of misclassification. Both methods contribute to building decision trees that accurately classify or regress data by iteratively selecting features and splitting nodes to enhance homogeneity within the resulting subsets. The choice between these methods often depends on the dataset characteristics and the desired behavior of the decision tree algorithm.
Consider this simple dataset, which illustrates how decision trees are executed. Decision trees are operated in a similar way to how the human brains think. It looks for a pattern and divides the data into various partitions. In this example dataset, there are four columns
Role
City
Education Level
Salary > 100K
This dataset compares the salaries of Data Scientists and Data analysts across various cities. If this is given to people, their thought process would be how to find patterns in this. Consider if the scenario is to find whose salaries are higher than 100K, the human brain tries to find the pattern by either partitioning the data into city by role or by education level, etc. This is a similar way to how decision trees work.
Creating a decision tree from a group of attributes can be done in various manners. However, determining the most effective decision tree is a highly demanding and challenging task due to the exponential growth of the search space, making it time-consuming and complex.
To discover optimal trees, heuristic techniques such as Hunt's Algorithm are utilized, aiding in generating locally optimized solutions.
A root node for the optimal tree is selected in such a way that maximum information gain is obtained. In this example lets assume the role is selected as root node then its divided into:
Here's an illustration of a decision tree derived from our dataset. At the very top sits the root node, which acts as the starting point for the tree. In this scenario, it's a binary decision: "Is the individual classified as a data scientist or not?" Based on the answer (yes or no), the tree branches out into parent nodes, representing subsequent questions or conditions.
For instance, one of these parent nodes might inquire about the individual's education level, asking whether it's at the bachelor's level or not. Depending on the response, the tree further splits into leaf nodes, which serve as the endpoints of the tree. These leaf nodes provide the final classification, such as whether the salary is higher than 100K or not based on the criteria set by the parent nodes. This is an example of a decision tree with depth 3, ie the number of layers of the decision tree is three.
A Python code was used for constructing the optimized decision trees. For an optimized decision tree the root node is selected as Role which implies using role as a root node produces maximum information gain. In this graph, the categories are encoded. The role of Data Scientist, city of New York, and degree Masters is represented as 1 and others are represented as 0.
Using GINI Method
Using Entropy Method
For GINI:
The root node is selected whether the role is data analyst or not.
The Gini for this is calculated by:
No. of roles having salary > 100K = 5
No. of roles not having salary > 100K = 4
For root node by formula of GINI:
GINI = 1 - (4/9)^2 - (5/9)^2
= 0.494
which matches our graph.
Similarly for the left parent node:
No. of data analysts having salary > 100K = 1
No. of data analysts not having a salary > 100K = 3
GINI = 1 - (1/4)^2 - (3/4)^2
= 0.375
Similarly for the right parent node:
No. of non-data analysts having a salary > 100K = 4
No. of non data analysts not having a salary > 100K = 1
GINI = 1 - (1/5)^2 - (4/5)^2
= 0.32
Information Gain =. 0.494 - (4/9)*0.375 - (5/9)*0.32
= 0.494 - 0.166 - 0.177
= 0.151
For Entropy:
The root node is selected whether the role is data analyst or not.
The Entropy for this is calculated by:
No. of roles having salary > 100K = 5
No. of roles not having salary > 100K = 4
For root node by formula of Entropy:
Entropy = - [(4/9)log2(4/9) + (5/9)log2(5/9)]
= 0.991
which matches our graph.
Similarly for the left parent node:
No. of data analysts having salary > 100K = 1
No. of data analysts not having a salary > 100K = 3
Entropy = - [(1/4)log2(1/4) + (3/4)log2(3/4)]
= 0.811
Similarly for the right parent node:
No. of non-data analysts having a salary > 100K = 4
No. of non data analysts not having a salary > 100K = 1
Entropy = - [(1/5)log2(1/5) + (4/5)log2(4/5)]
= 0.722
Information Gain =. 0.991 - (4/9)*0.811 - (5/9)*0.722
= 0.991 - 0.36 - 0.40
= 0.151
In this project, decision trees are pivotal for predicting the duration of flight delays stemming from weather conditions. Utilizing various factors such as date, month, origin state, destination state, origin temperature, and destination temperature as inputs, the decision tree model is trained to classify delays as either short or long. Due to the mixed data of the dataset, traditional libraries like scikit-learn are unsuitable, prompting the use of R for constructing decision trees. With its inherent interpretability, decision trees help to understand the underlying decision-making process, shedding light on influential factors driving delay durations. Ultimately, this approach aims to capture nuanced relationships between features, facilitating accurate classification of flight delays based on weather conditions.
It is easy to understand and interpret decision trees.
Decision trees can handle both categorical and continuous data and it is a nonparametric model which means there are no assumptions about the distribution of data.
Decision trees cannot be used in case of complex relationships between the features.
Decision trees tend to overfit and cannot produce results effectively in case of class imbalance.