Image Source.
Decision trees are a type of supervised learning algorithm used for classification and regression analysis. The basic idea behind decision trees is to create a model that predicts the value of a target variable based on several input variables. It is called a tree because it starts with a single node, which represents the root of the tree, and then branches out to different nodes that represent different decisions based on the input variables.
The root node of the decision tree represents the entire dataset, and each subsequent node represents a subset of the data. The branches that extend from each node represent the possible outcomes of a decision based on a specific input variable. The leaf nodes, which are the final nodes in the tree, represent the outcome or prediction.
Decision trees can be used for both classification and regression problems. In classification, the target variable is categorical, and the goal is to predict the class or category that a new observation belongs to. In regression, the target variable is continuous, and the goal is to predict a numerical value.
One of the main advantages of decision trees is that they are easy to understand and interpret. Decision trees can also handle both categorical and numerical data, and they can handle missing data as well. Another advantage is that decision trees can be used to identify important features or variables that contribute most to the target variable.
To create a decision tree, we need to determine the best split at each node, which is based on a measure of impurity or information gain. Common measures of impurity include Gini impurity and entropy. Information gain is used to measure the reduction in impurity that results from a split.
One challenge with decision trees is that it is possible to create an infinite number of trees, each with different combinations of input variables and splits. To address this, techniques such as pruning and setting a maximum depth or minimum number of samples at a leaf node can be used to simplify the tree and reduce overfitting.
In decision trees, the measure of impurity or the measure of the randomness of the data at a node is used to determine the best split. GINI impurity, entropy, and information gain are commonly used measures of impurity in decision trees.
GINI impurity is a measure of how often a randomly chosen element from the set would be incorrectly labelled if it were randomly labelled according to the distribution of labels in the subset. It ranges from 0 to 1, where 0 indicates that the set is completely pure (all elements belong to the same class) and 1 indicates that the set is completely impure (elements are equally distributed across all classes).
Entropy is a measure of the randomness of the data at a node. It ranges from 0 to 1, where 0 indicates that the set is completely pure (all elements belong to the same class) and 1 indicates that the set is completely impure (elements are equally distributed across all classes).
Information Gain is the reduction in entropy or GINI impurity that results from a split. The split that results in the highest information gain is considered the best split.
For example, consider a dataset with the following variables:
One wants to create a decision tree to predict whether a person will make a purchase or not based on their age, gender, and income. Suppose we start with the root node and consider splitting on the age variable. We can calculate the GINI impurity and entropy for this split as follows:
GINI impurity:
Left child (age < 30): p(0) = 1/2, p(1) = 1/2
GINI impurity = 1 - (1/2)^2 - (1/2)^2 = 0.5
Right child (age >= 30): p(0) = 1/3, p(1) = 2/3
GINI impurity = 1 - (1/3)^2 - (2/3)^2 = 0.44
Overall GINI impurity for split = (2/5)*0.5 + (3/5)*0.44 = 0.464
Entropy:
Left child (age < 30): p(0) = 1/2, p(1) = 1/2
Entropy = -1/2 * log2(1/2) - 1/2 * log2(1/2) = 1
Right child (age >= 30): p(0) = 1/3, p(1) = 2/3
Entropy = -1/3 * log2(1/3) - 2/3 * log2(2/3) = 0.92
Overall entropy for split = (2/5)*1
The conclusion from the above calculation is that the split based on the "gender" attribute has a higher information gain, and therefore it is a better split. The information gain for "gender" is 0.02, which is higher than the information gain for "age group" (0.002) and "income" (0.0004). This means that splitting the data based on the "gender" attribute will result in a more effective separation of the classes, compared to the other attributes.
It is generally possible to create an infinite number of decision trees because there are different ways to split the data at each level, and the number of possible splits can be very large. For example, if we have a dataset with 10 attributes, there are 2^10 - 1 = 1023 possible ways to split the data at the first level. If we continue to split the data at subsequent levels, the number of possible trees can become very large.
Moreover, there are different ways to measure the quality of a split, such as the GINI index, entropy, and information gain. Each measure can lead to a different tree structure, which means that there are multiple possible trees that can be generated from the same dataset. Additionally, decision trees can be extended with different techniques, such as pruning, ensemble methods, and random forests, which can further increase the number of possible tree structures.
Therefore, the number of possible decision trees is practically infinite, although some trees may be more effective and accurate than others, depending on the data and the specific problem at hand.