Decision Trees (DTs) are a popular machine learning algorithm used for classification and regression tasks. They work by splitting the dataset into branches based on feature values, with each branch representing a decision rule. At the end of the tree, leaf nodes represent class labels (for classification) or continuous values (for regression). Decision Trees are particularly useful for their interpretability and ability to model non-linear relationships in data.
Applications of Decision Trees
Classification: Predicting labels for categorical data, such as identifying spam emails or determining product categories.
Regression: Predicting continuous outputs, like house prices or sales forecasts.
Feature Selection: Identifying the most significant features influencing an outcome.
How Are Decision Trees Trained?
Feature Selection for Splitting: At each step, the tree chooses the feature that best splits the data. The "goodness" of a split is determined using metrics such as GINI, entropy, or information gain.
Recursive Partitioning: The data is repeatedly divided into smaller subsets based on the selected feature until stopping conditions are met.
Leaf Node Assignment: At the end of each path, the algorithm assigns a class label or regression value based on the majority class (classification) or mean/median value (regression) in the subset.
How Do Decision Trees Make Predictions?
Once trained, a Decision Tree predicts an output for a given input by:
Starting at the root node.
Evaluating the input feature against the split conditions at each internal node.
Following the branch corresponding to the condition until reaching a leaf node.
Assigning the output value stored in the leaf node.
A measure of impurity in a node.
A lower GINI value represents purer nodes. Splits aim to minimize GINI to achieve homogeneous classes.
Why is it Used?: It is computationally efficient and works well for classification problems by aiming to create splits where nodes are dominated by a single class.
A measure of uncertainty or randomness in the data.
Like GINI, lower entropy values indicate purer splits.
Why is it Used?: Entropy quantifies uncertainty, and minimizing entropy ensures splits produce purer subsets.
where ni is the size of each child node, and N is the total number of samples.
The reduction in impurity after a split, calculated as the difference between the parent node's impurity and the weighted sum of child nodes' impurity.
Why is it Used?: Information Gain selects splits that maximize the reduction in impurity, leading to better predictions.
Small Example Using GINI and Information Gain
Parent Node:
A dataset with 100 samples distributed across 3 classes:
Class A: 50 samples
Class B: 30 samples
Class C: 20 samples
Step 1: Calculate GINI for the Parent Node:
GINI_Parent=1−(0.50^2+0.30^2+0.20^2)=0.62
Step 2: Perform a Split:
Left Node: 40 samples (30 from Class A, 10 from Class B)
Right Node: 60 samples (20 from Class A, 20 from Class B, 20 from Class C)
Step 3: Calculate GINI for Child Nodes:
Left Node: GINI_Left=1−(0.75^2+0.25^2+0^2)=0.375
Right Node: GINI_Right=1−(0.33^2+0.33^2+0.33^2)=0.667
Step 4: Calculate Weighted Average GINI for Child Nodes:
GINI_Children=(40/100)×0.375+(60/100)×0.667=0.55
Step 5: Calculate Information Gain:
IG=GINI_Parent−GINI_Children=0.62−0.55=0.07
Why Infinite Trees Are Possible
Decision Trees can continue to split the data indefinitely by:
Finding new feature combinations or thresholds.
Pursuing splits even for small differences in impurity.
This can lead to overfitting, where the tree memorizes the training data instead of generalizing. Practical trees limit growth using parameters like max_depth, min_samples_split, or min_samples_leaf.
The dataset used in this analysis contains important features related to coal shipments, such as:
ash-content: A measure of the ash residue in coal.
heat-content: The heat energy produced when coal is burned.
price: The price of coal shipments.
quantity: The amount of coal shipped.
sulfur-content: A measure of sulfur levels in coal.
A new label, carbon_intensity, was created to classify shipments into three categories: High, Medium, or Low. The classification is based on specific thresholds for sulfur content, ash content, and shipment quantity, reflecting the carbon intensity of the shipments:
High: Sulfur-content > 2.0, ash-content > 10.0, and quantity > 1,000,000.
Medium: Sulfur-content > 1.0, ash-content > 5.0, and quantity > 500,000.
Low: All other shipments.
Snapshot of the datset after processing
Training and Testing Split
The dataset was divided into training and testing subsets to enable unbiased evaluation of the models. The split ratio used was 70% for training and 30% for testing. This ensures that the models are trained on one portion of the data and evaluated on an unseen portion, which prevents overfitting and provides an accurate measure of performance.
Training and Testing Dataset Statistics
Training Set: 70% of the data
Testing Set: 30% of the data
Features: ash-content, heat-content, price, quantity, sulfur-content
Label: carbon_intensity
The train-test split ensures the model is evaluated on data it has not seen during training. This prevents overfitting and helps assess how well the model generalizes to new, unseen data. Keeping the sets disjoint guarantees the integrity of the evaluation process.
Overview
The Decision Tree algorithm was employed on the coal shipment dataset to classify shipments into three carbon intensity categories: Low, Medium, and High. The algorithm splits data points at decision boundaries defined by features such as sulfur content, ash content, and quantity. Decision Trees are inherently interpretable and allow visualization of the decision-making process. The depth of the tree was varied to study its effect on the model's classification accuracy and complexity, providing insights into the trade-off between interpretability and performance
2. Model Configurations
To evaluate the impact of tree depth, the Decision Tree algorithm was trained and tested under three configurations:
Max Depth = 3: A shallow tree to enhance interpretability and limit overfitting while potentially sacrificing accuracy.
Max Depth = 5: A balanced depth aimed at optimizing both classification performance and complexity.
Max Depth = 7: A deeper tree to capture intricate decision boundaries at the cost of reduced interpretability.
DECISION TREE (MAX DEPTH = 5)
Accuracy: 100.0%
Analysis:
The model achieved high accuracy, classifying the majority of samples correctly.
Misclassifications were observed, particularly in the "Low" and "High" carbon intensity categories.
The limited depth led to a simpler tree that is easier to interpret but insufficient for capturing nuanced patterns in the dataset.
DECISION TREE (MAX DEPTH = 5)
Accuracy: 100.0%
Analysis:
Perfect accuracy was achieved, with all categories classified correctly.
The additional depth allowed the model to capture more intricate patterns in the data, effectively differentiating between "Low," "Medium," and "High" categories.
This configuration represents the optimal balance between performance and interpretability.
DECISION TREE (MAX DEPTH = 7)
Accuracy: 100.0%
Analysis:
Accuracy remained at 100%, indicating that the additional complexity did not improve performance.
The tree became more complex and less interpretable due to deeper splits that added unnecessary detail without contributing to classification accuracy.
1. Key Learnings
The application of Decision Tree modeling provided valuable insights into the classification of carbon intensity levels within the coal shipment dataset. The analysis revealed several important findings:
Feature Importance:
Features such as sulfur content, ash content, and quantity emerged as the most critical predictors of carbon intensity levels. These features consistently appeared as decision nodes at the higher levels of the tree, regardless of the depth configuration.
This indicates that these variables are highly influential in determining whether a shipment falls into the "Low," "Medium," or "High" carbon intensity category.
Model Performance:
The decision trees performed exceptionally well in classifying the data. The model with Max Depth = 3 achieved 93.62% accuracy, while deeper trees with Max Depth = 5 and 7 achieved perfect accuracy (100%).
The progression in depth demonstrated how additional splits allowed the model to capture more nuanced decision boundaries, leading to improved performance.
Role of Tree Depth:
Max Depth = 3: While this depth produced a simple, interpretable tree, it was unable to perfectly classify the data due to limited granularity in the splits.
Max Depth = 5: Achieved optimal performance by striking a balance between complexity and interpretability. This configuration captured sufficient patterns without overcomplicating the decision-making process.
Max Depth = 7: The tree retained 100% accuracy but introduced unnecessary complexity, which could lead to overfitting in larger datasets.
Classification Insights:
The model effectively separated the "Medium" carbon intensity category, with very few misclassifications observed in the confusion matrices.
Misclassifications at lower depths (e.g., Depth = 3) highlighted overlapping feature distributions between the "Low" and "High" categories, which were resolved with deeper splits.
2. Predictive Applications
The Decision Tree models demonstrated their potential to predict carbon intensity levels for future coal shipments based on shipment properties. These predictions have practical implications:
Operational Optimization:
Energy companies can use the model to proactively identify high-carbon shipments and implement measures to mitigate their environmental impact, such as adjusting shipment schedules or sourcing coal from cleaner suppliers.
Policy Implications:
Insights into key predictors can inform policy decisions aimed at reducing carbon emissions from coal shipments by incentivizing the use of lower-carbon materials.
Supply Chain Planning:
By understanding the factors driving carbon intensity, stakeholders can optimize supply chains to prioritize low-carbon shipments, enhancing sustainability and compliance with environmental regulations.
3. Limitations
While the Decision Tree models provided high accuracy and valuable insights, the analysis highlighted some limitations:
Model Overfitting:
Increasing tree depth beyond 5 did not improve accuracy, suggesting that additional complexity may overfit the model to the training data.
Feature Engineering:
The model's performance relied heavily on the selected features. Additional features, such as geographic location, coal type, or supplier information, could enhance the model's predictive power.
Scalability:
While the current dataset size was manageable, deeper trees may become computationally expensive and less interpretable with larger datasets.
4. Future Work
To build on the findings of this analysis, the following steps are recommended:
Testing Ensemble Methods:
Models like Random Forests and Gradient Boosting could be tested to improve classification accuracy and mitigate overfitting while maintaining interpretability.
Incorporating Additional Features:
Including external variables such as supplier data, transport routes, or regional policies may uncover new insights and improve the model’s generalizability.
Exploring Non-linear Relationships:
Advanced techniques such as feature transformations or interaction terms can help capture complex patterns in the data that simple splits might miss.
Validation on External Data:
Testing the model on a separate, unseen dataset would provide a better understanding of its real-world applicability and robustness.
5. Final Thoughts
The Decision Tree analysis highlighted the importance of balancing model interpretability and complexity. While deeper trees captured intricate patterns and achieved perfect accuracy, the model with Max Depth = 5 struck an optimal balance, providing high accuracy and interpretable results. This balance is crucial when applying models in real-world scenarios where decision-makers must understand and trust the outputs.
By identifying critical features such as sulfur content and quantity, this study underscores the potential for data-driven approaches to reduce carbon emissions in coal shipments. The results provide a foundation for actionable strategies in sustainability and operational efficiency, paving the way for more advanced analyses and real-world applications.