Decision Trees
Decision Trees
Overview
Decision Trees are a popular and interpretable form of supervised machine learning, which can be utilized for classification as well as regression problems. They function by mimicking human decision-making and breaking down complex problems via a series of simple decisions. Each internal node in a decision tree is a test on an attribute, each branch is the outcome of the test, and each leaf node is a class label or predicted value. The tree model is easy to visualize and understand the rationale behind the predictions generated by the model. For classification issues like classifying whether a movie is a commercial success or not, decision trees can readily learn patterns from ratings by audiences, box office earnings, genres, and other features. Decision trees' beauty is that they can handle numeric and categorical data, missing values, and do not require feature scaling.
Decision trees' greatest strengths include the ability to evaluate splits by some measures of impurity or disorder. The most used measure is Gini Index, which estimates the probability of misclassifying an element picked at random if it was labeled randomly according to the label distribution in the dataset. The lower the Gini Index, the purer the node. For selecting a split, decision trees select the attribute that results in the maximum reduction of the parent/child Gini Index. This helps to build a tree that classifies examples with growing confidence at each level. The equation on the left shows how Gini is calculated and pj is the probability of jth class label.
Another measure that is widely used is Entropy, a quantification of randomness or disorder in a dataset. The entropy is higher, the more impure (mixed classes) the dataset, and the less the entropy, the purer it is. When a split occurs, the Information Gain is the reduction of entropy before and after the split. In the equation on the left pi is the probability of class i occurring in the dataset.
Data splitting in a decision tree is a dependent activity based on an Information Gain metric with which the node splits for any attributes. Information Gain gauges the reduction of entropy when splitting the observed dataset on a specific attribute. It quantifies how much "useful information" a feature imparts about a class label. A split that yields more pure or lower entropy child nodes increases the information gain. The decision tree algorithm selects the feature with the highest Information Gain to produce the best split and get closer to a precise and efficient classification.
For example, in the given decision tree, entropy is used to quantify the impurity or disorder of each decision node. The dataset most likely initially contains a mix of "Yes" and "No" responses, leading to a high entropy. Using the attribute "Age," the first split divides the data into two branches: ≤30 and >30. Additional attributes, like "Car type" or "# Children," further distinguish each branch. By determining the entropy for the subset of data that reaches the node, the goal is to choose the feature that results in the greatest Information Gain, or the greatest decrease in entropy, at each node. Entropy decreases as we move down the tree and create more homogeneous (pure) groups; in the leaf nodes where each sample belongs to a single class (either all "Yes" or all "No"), it should ideally reach zero. By following this process, a useful tree that classifies the data with the least amount of uncertainty is created.
It is generally possible to create an unlimited number of decision trees because trees can keep splitting data until each data point is ideally labeled. However, in so doing, they overfit, memorizing the training data instead of generalizing from it. This is why techniques like pruning, a max depth, or a min number of samples in leaf nodes have to be employed in order to control tree complexity. With a minor change in the training data, a extremely different tree can be generated, and therefore they are prone to high variance. However, their intuitive structure, coupled with good metrics for evaluating splits, makes decision trees a very effective first approach to most classification issues, especially in exploratory research where model interpretability is valuable.
To apply Decision Tree modeling appropriately, the data must be prepped on the front with purely numeric features and encoded target label. That is why we utilized the same cleaned data from Multinomial Naive Bayes analysis. The target variable MOVIE_CATEGORY was encoded with label encoding to convert string labels (like "Successful", "Failure", and "Mediocre") into numeric values. Variables provided for model training are DURATION, RATING, VOTES, BUDGET, GROSSWORLDWIDE, NOMINATIONS, and OSCARS. They are numeric variables and help identify trends depending on whether the film is successful or not. The data was then split into training and test sets with varying ratios using train_test_split() function from scikit-learn after cleaning.
The training set has 80% of the data and is used to build the decision tree, while the testing set, which makes up the remaining 20%, is used to test the model's generalizability. Both subsets are disjoint, i.e., no data from the training set is reused in testing. This disjoint property ensures that there is no data leakage, maintaining the test's integrity. Without this separation, the model might memorize patterns instead of learning them, and this would result in implausibly high accuracy scores that would not transfer to new data. The train and test sets reflect the variation of movie records for each of the three genres of movies.
Each subset includes samples with various levels of votes, budgets, and gross revenues, among others. A preview of each set is also taken in the notebook to visually ensure the integrity and quality of the split. This difference between the training and test sets needs to be maintained in order to learn good and correct observations from any supervised machine learning algorithm. This is the setup upon which the Decision Tree is learned and tested against its correctness for predicting whether a movie will or will not succeed.
Results and Conclusion
The first tree was developed using the Gini impurity measure and a depth of 3. The tree began with a root split on the attribute "RATING" and further splits on "GROSSWORLDWIDE," "OSCARS," and "BUDGET." The model's confusion matrix shows that the tree correctly predicted "Failure" and "Successful" movies as 1967 and 767 respectively. The model was 86.25% accurate overall. The model was extremely poor, however, with the "Mediocre" class, predicting zero instances of it, which was reflected in a precision and recall of 0 for this class.
The second tree used Entropy as the splitting criterion, again with the depth set to 4. Both trees used the same splits except that they used different methods for choosing features based on information gain using entropy. The confusion matrix gives a performance slightly superior to the Gini tree, with 2016, 195, and 743 correct classifications of "Failure," "Mediocre," and "Successful" classes respectively. The accuracy was marginally better at 87.03%. The most significant improvement was in the classification of the "Mediocre" category, but still, the recall and precision of the class was low in general.
The third tree used Gini impurity again with a depth of 4 and a requirement of a minimum of 50 samples to split a node. This regularization avoided overfitting by requiring a minimum of 50 samples to divide a node. The model correctly predicted for "Failure" (1967) and "Successful" (770) classes but, as in the case of the first model, failed to classify any "Mediocre" cases correctly. The accuracy of this tree was 86.34%, which was slightly higher than the first tree but slightly lower than the Entropy-based model.Nevertheless, it possessed a balanced distribution of nodes and was less complex in terms of depth compared to deeper trees.
Comparing all the three trees, the Entropy-based tree (Tree 2) had the highest accuracy and macro-average results, particularly given that it was slightly more attuned to the "Mediocre" class. Both the Gini-based trees (Tree 1 and Tree 3) were on par with each other in sensing "Failure" and "Successful" films but were weak in "Mediocre" prediction. Tree 3 with its min samples split gave the most generalizable and neatest model with zero overfitting and is thus perfectly tuned to performance and simplicity. If sensitivity to the "Mediocre" class is an appealing goal, however, the Entropy tree is better.
Generally, from the decision tree models, one can draw clear conclusions for movie performance prediction. The consistent application of such attributes as "RATING," "GROSSWORLDWIDE," "OSCARS," and "BUDGET" to all trees highlights the strong predictive ability in classifying a movie's success category. The ability of decision trees in separating high-performing and low-performing films with high accuracy highlights that well-structured numeric data can make strong distinctions in box-office success. But the difficulty of predicting the "Mediocre" category suggests that other characteristics or more thresholds are necessary for higher granularity. These results confirm the potential of decision trees for classification and discovering comprehensible rules in movie analytics.