Decision Trees (DTs) are a popular supervised learning algorithm used for both classification and regression tasks. They are powerful and versatile tools that can handle both categorical and numerical data, making them suitable for a wide range of applications in various domains, including but not limited to finance, healthcare, and environmental science.
A Decision Tree is a hierarchical tree-like structure composed of nodes and branches. At each node, the algorithm makes a decision based on the value of a certain feature, splitting the data into subsets. This process continues recursively until a stopping criterion is met, such as reaching a maximum depth or no further improvement in purity. The final nodes of the tree, called leaf nodes, represent the predicted outcome or class label.
Figure 1 showcases the basic structure of a Decision Tree:
1. Root Node: The root node is the topmost node of the Decision Tree. It represents the starting point of the decision-making process. In terms of the tree's structure, the root node is where the initial decision is made based on the value of a selected feature. It serves as the entry point for the data and sets the stage for subsequent branching.
2. Branch: A branch represents a decision path or a split in the data based on the value of a feature at a particular node. Each branch extends from a node and leads to further nodes or leaf nodes, depending on the decision criteria. Branches represent the possible outcomes or choices resulting from the decision made at the node. They illustrate the flow of the decision-making process as the algorithm navigates through the tree.
3. Leaf: A leaf node is a terminal node of the Decision Tree that does not split further. It represents the final outcome or prediction made by the algorithm based on the features' values along the decision path. In classification tasks, each leaf node corresponds to a specific class label, while in regression tasks, leaf nodes contain continuous predicted values. Leaf nodes provide the ultimate result of the decision process, indicating the predicted class or value for a given input.
4. Internal Node: An internal node is any node in the Decision Tree that is not a leaf node. These nodes represent decision points where the algorithm evaluates the value of a specific feature and decides which branch to follow based on the feature's value. Internal nodes guide the flow of the decision-making process, leading to further nodes or leaf nodes until a stopping criterion is met. They play a crucial role in partitioning the data into subsets based on the feature values, ultimately contributing to the accurate prediction or classification of the input data.
- The root node is positioned at the top of the tree, symbolizing the beginning of the decision-making process.
- Branches extend from the root node, representing decision paths based on the value of a selected feature.
- Internal nodes are located along the branches, indicating decision points where the algorithm evaluates feature values and determines the next steps.
- Leaf nodes are situated at the ends of the branches, depicting the final outcomes or predictions made by the algorithm.
For this project, we plan to use Decision Trees to classify news articles on climate change into different categories or topics. By doing so, we aim to gain insights into the prevailing themes and topics discussed in the media coverage of climate change. Specifically, we intend to classify articles based on their content into categories such as "Policy and Regulation," "Scientific Research," "Environmental Activism," "Climate-related Events," and "Climate Change Impacts."
This classification will inform our topic by providing a structured and organized overview of the diverse perspectives and narratives surrounding climate change in the media. We expect that Decision Trees will accurately classify the articles into relevant categories based on the distinctive features present in the text, such as keywords, phrases, and linguistic patterns associated with each topic. By analyzing the classified data, we aim to identify emerging trends, areas of consensus or contention, and key issues in public discourse on climate change, ultimately informing decision-making processes, communication strategies, and policy development initiatives in addressing this critical global challenge.
Supervised modeling, including Decision Trees, relies on labeled data to train predictive models effectively. Labeled data consists of input features paired with corresponding target labels or outcomes. These labels serve as the ground truth, providing the algorithm with examples of how the input features should be classified or predicted.
Once we have labeled data, the next step is to split it into two subsets: the Training Set and the Testing Set. The Training Set is used to train or build the model, while the Testing Set is used to evaluate the model's performance and accuracy.
The Training Set:
- The Training Set comprises a portion of the labeled data, typically the majority of the dataset, such as 70-80%.
- This subset is used to teach the algorithm to recognize patterns and relationships between the input features and their corresponding labels.
- During the training process, the algorithm adjusts its internal parameters based on the labeled examples in the Training Set, optimizing its ability to make accurate predictions.
- The Training Set serves as the foundation for building a predictive model that can generalize well to unseen data.
The Testing Set:
- The Testing Set consists of the remaining portion of the labeled data, typically 20-30% of the dataset.
- This subset is kept separate from the Training Set and is used to evaluate the model's performance.
- The Testing Set allows us to assess how well the trained model generalizes to new, unseen data.
- By evaluating the model on the Testing Set, we can measure its accuracy, precision, recall, and other performance metrics.
- The Testing Set simulates real-world scenarios where the model encounters new data during deployment and provides insights into its effectiveness in making predictions.
Creating the Training and Testing Sets:
1. Random Splitting: The labeled dataset is randomly partitioned into two subsets, ensuring that each data point has an equal chance of being included in either the Training or Testing Set.
2. Stratified Splitting (Optional): In classification tasks with imbalanced class distributions, we may employ stratified splitting to ensure that the proportion of each class is preserved in both the Training and Testing Sets.
Ensuring that the Training and Testing Sets are disjoint is crucial for the reliable evaluation of machine learning models. Disjointness means that no data points are shared between these sets, creating a clear boundary between the data used for model training and that used for evaluation. This approach simulates real-world scenarios where models encounter new, unseen data during deployment. By evaluating the model on disjoint testing data, we obtain a more accurate assessment of its ability to generalize beyond the training data.
Disjoint sets are paramount for preventing the model from simply memorizing the training data without truly learning the underlying patterns. If the same data points were used for both training and testing, the model might perform well on the testing set due to familiarity with similar instances seen during training, rather than genuine generalization to unseen data. This phenomenon, known as overfitting, can lead to inflated performance metrics and poor performance on new data. Disjoint sets mitigate the risk of overfitting by ensuring that the model is evaluated on genuinely novel data.
Moreover, disjoint sets guard against data leakage, a scenario where information from the testing set inadvertently influences the model training process. Data leakage can result in artificially inflated performance metrics, leading to overestimated model performance. By keeping the training and testing sets completely separate, disjointness prevents any unintentional incorporation of testing set information into the training process, preserving the integrity of the evaluation process.
Figure 7 represents the Default Decision Tree. It is a model for evaluating the sentiment (negative, neutral, positive) of given text snippets based on certain criteria or conditions.
There are two text snippets - The first snippet discusses staff coordination with eco groups and an issue involving the Biden administration. The second snippet mentions a conservative commentator being ordered to pay 1M for defaming a climate scientist.
Figures 8 and 9 represent further Decision trees based on variation in parameter Pruned and Limited Depth respectively)
For Figure 8 and 9 - There are two text snippets included in the image that were evaluated by this decision tree. The first snippet discusses ecological transformation and issues involving the Biden administration related to John Kerry’s staff coordination with eco groups. The second snippet mentions a DC jury ordering a conservative commentator to pay for defaming a climate scientist.
This pruned decision tree and limited depth decision trees are a powerful tool for understanding how the model makes its predictions. It provides a visual representation of the decision-making process, showing how different conditions lead to different outcomes.
Last;y, Figure 10 showcases the Multi Split decision tree. being similar to the default DT (figure 7), this DT was also decently helpful in determining the split of the sentiments wrt our dataset. As shown, the same parameters and features were evaluated as the default DT, nonetheless, it was helpful in figuring out if the DT algorithm is helpful for our dataset or not.
After conducting an analysis on our dataset, several conclusions can be drawn regarding the effectiveness of Decision Trees (DT) in addressing the problem at hand. While Decision Trees are often considered as a versatile and interpretable machine learning algorithm, their performance on the dataset did not meet the expected standards. Despite achieving a reasonable accuracy, the Decision Trees did not exhibit the level of performance anticipated for our specific dataset
One of the notable observations from our analysis is that while the Decision Trees achieved a decent accuracy rate, they did not effectively capture the underlying patterns and complexities present in the data. This indicates that the Decision Trees may have struggled to generalize well to unseen data points or to accurately classify instances with more nuanced features. Consequently, relying solely on Decision Trees as the primary predictive model for our dataset may not be the optimal approach.
Furthermore, it became evident that the dataset used for our analysis may not have been well-suited for the Decision Tree algorithm. Decision Trees tend to perform better on datasets with clear and distinct decision boundaries, whereas our dataset may have contained more intricate relationships between the features and the target variable. As a result, the Decision Trees may have faced challenges in accurately partitioning the feature space to effectively separate the different classes of the target variable.
While the Decision Trees provided some insights into the data and helped in understanding the importance of various features, it is apparent that they may not be the most suitable algorithm for our specific dataset.