A Decision Tree is a flowchart-like model used in supervised machine learning to make decisions or predictions based on input features. It works by asking a series of questions (like "Is stress level > 6?") and following the path of answers down the tree until a final decision or prediction is reached at a leaf node. Each internal node in the tree represents a decision rule based on a feature, and each branch represents the outcome of that rule. The tree is built by recursively selecting the most informative features that best split the data into target classes. Decision Trees are intuitive, easy to visualize, and mimic human decision-making, which makes them especially useful for interpreting how a model reaches its conclusions.
To decide how to split the data at each node, Decision Trees use impurity measures such as Gini impurity and Entropy. These measures evaluate how mixed the classes are in a given node the lower the impurity, the more "pure" the node is. Entropy comes from information theory and quantifies the disorder in a dataset. Information Gain is then calculated as the reduction in entropy after a dataset is split on a feature the greater the gain, the more informative the feature is. Gini impurity is a simpler alternative that measures the probability of incorrectly classifying a randomly chosen element if it were labeled according to the class distribution in the node. Both methods aim to find the most effective features to split on, guiding the tree to make more accurate predictions.
Let’s say we have a dataset of 10 students, and we want to predict if they have High Anxiety (Yes or No) based on their Sleep Hours.
Initial Data:
6 students have High Anxiety → "Yes"
4 students have Low Anxiety → "No"
Entropy of the entire dataset:
Entropy = - (6/10) * log2(6/10) - (4/10) * log2(4/10)
Entropy ≈ -0.6 * 0.737 - 0.4 * 1.322
Entropy ≈ 0.971
Now we split on "Sleep Hours":
Group 1 (Sleep ≤ 6 hrs): 5 students → 4 Yes, 1 No
Group 2 (Sleep > 6 hrs): 5 students → 2 Yes, 3 No
Entropy of Group 1:
= - (4/5) * log2(4/5) - (1/5) * log2(1/5) ≈ 0.722
Entropy of Group 2:
= - (2/5) * log2(2/5) - (3/5) * log2(3/5) ≈ 0.971
Information Gain:
Info Gain = Entropy before split - Weighted Entropy after split
= 0.971 - [0.5 * 0.722 + 0.5 * 0.971]
= 0.971 - 0.847 = 0.124
So, splitting based on "Sleep Hours" gives an Information Gain of 0.124, which helps us decide that this feature provides some separation of the target classes.
It’s possible to create an infinite number of decision trees because:
You can use different combinations and orders of features for splits.
Continuous features can be split at an infinite number of threshold values (like splitting at 6.0, 6.1, 6.2, etc.)
Trees can grow very deep without constraints, especially when there’s no rule for stopping (like minimum samples or maximum depth).
Even a small change in training data can result in a totally different tree structure.
Unless we set specific limits (like max depth, pruning, or minimum samples), the number of possible trees is virtually endless.