XGBoost
XGBoost
Overview
Ensemble learning is an artificial intelligence procedure whereby multiple models (typically referred to as "weak learners") are combined to create a stronger predicting model. It is assumed that although individual models can make errors, the predictions made collectively will cancel out their faultiness and therefore be more accurate and resilient in general. Ensemble methods are of various types, such as bagging (e.g., Random Forest), boosting (e.g., XGBoost, AdaBoost), and stacking, which differ in how they build and blend multiple models. These methods are widely used to enhance model performance, reduce variance, bias, and enable generalization to new data.
XGBoost, short for Extreme Gradient Boosting, is a very fast and scalable gradient boosting algorithm implementation. XGBoost builds an ensemble of a sequence of decision trees, and every new tree attempts to negate the errors made by the previous trees. XGBoost, as opposed to simple boosting algorithms, has regularization to prevent overfitting, built-in missing value handling, optimized sparse data and parallel computation handling, and the capability to process structured (tabular) data. Its ability to process structured (tabular) data coupled with its performance and speed makes it a good option.
XGBoost was applied to this project on the movie dataset following feature selection, target variable (MOVIE_CATEGORY) label encoding, and standardization of feature values. Features DURATION, RATING, VOTES, BUDGET, GROSSWORLDWIDE, NOMINATIONS, and OSCARS were standardized to unit variance and zero mean. The dataset was then split into training and test sets using stratified sampling so that the class distribution is preserved. XGBoost classifier was also trained using hyperparameters like max_depth=6, learning_rate=0.1, and n_estimators=200. The model achieved incredibly high accuracy of 98.5%, highlighting the capability of detecting fine patterns and correlations between the data and categorizing movies into groups like Failure, Mediocre, and Successful with very high precision and recall.
Data Prep and Code
The pre-processing data includes seven judiciously chosen numeric features: DURATION, RATING, VOTES, BUDGET, GROSSWORLDWIDE, NOMINATIONS, and OSCARS, and the MOVIE_CATEGORY target. The data were raw, unnormalized numbers with large magnitudes especially for monetary-like features like BUDGET and GROSSWORLDWIDE. Despite the fact that label encoding on MOVIE_CATEGORY had already been accomplished (already numerical), the features themselves still in their natural scales. In this stage, the dataset was only selected and encoded but not normalized or ready for use with machine learning algorithms sensitive to feature scale.
After preparation, the dataset structure remains the same as regards the columns but the most critical transformation was feature value scaling performed using a standardization method (e.g., StandardScaler). This ensures that all features will be on similar scales, typically mean 0 and std 1, so XGBoost and other ML algorithms can perform better and converge more quickly. Even though the table appears approximately balanced in sight because XGBoost takes raw values internally, preprocessing was performed so that informative imbalance between features (such as financial information and ratings) will not cause biases or instability during model training.
After XGBoost model preparation, the data was standardized using a StandardScaler to scale all the numerical feature columns to mean 0 and standard deviation 1. Standardization was a preprocessing step required to make each feature contribute equally to the learning of the model without being influenced by differences in scales of values. Training and testing sets also have the approximations of zero for these types of features like DURATION, RATING, VOTES, BUDGET, GROSSWORLDWIDE, NOMINATIONS, and OSCARS. Normalization enables the fast convergence as well as higher performance of the XGBoost classifier, and that leads to its more effective capacity to choose the optimal splits without bias towards the large range numbers in other features.
Results and Conclusion
The classification report also shows XGBoost's superior performance, with a total accuracy of 99%. Precision, recall, and F1-scores are all extremely high for all classes, with the "Failure" and "Successful" classes both receiving near-perfect 0.99 scores and the "Mediocre" class receiving an extremely strong 0.92 on all three. Macro and weighted averages are both well above 0.95, which indicates well-balanced performance even considering class imbalance (i.e., the minority Mediocre subgroup). This balance shows how well the model generalizes and does not over-estimate majority classes in a skewed manner.
Confusion matrix plots the XGBoost model prediction result against the actual labels of the three classes of films: Failure, Mediocre, and Successful. With such strong diagonal dominance , accurately predicting 2081 Failures, 217 Mediocres, and 825 Successes , the model can be observed to have very good classification ability in all three classes. Few misclassifications are observed, with minimal confusion among classes, showing highly accurate and stable predictive performance even for the relatively smaller Mediocre class.
XGBoost Tree 0 visualization shows the first decision tree generated by the XGBoost model. It shows how the model splits the data based on different features (named as f0, f1, f4, etc.) and respective thresholds. Red arrows indicate the direction of splits in the positive leaf values, and blue arrows indicate the direction of splits in the negative leaf values. Both of these nodes are condition nodes based on features, and terminal nodes or leaves give a score to be used for final class prediction. Tree height and width show the ability of XGBoost to identify non-linear patterns in data needed with rich diversity of movie performance values provided.
Overall, XGBoost performed significantly better than all the SVM models tried by a very large margin with almost perfect accuracy and consistent performance for all the movie genres. Its non-complex pattern modeling in a non-linear way without sacrificing in class imbalance handling made it the highest performing classifier for the dataset. In comparison to SVM models, class 1 ("Mediocre") with low recall. XGBoost not only gave us a good representation of "Failure" and "Successful" classes but also took care of "Mediocre" with good precision as well as recall. Thus, in this exercise, XGBoost is the most accurate and consistent model.