Overview
Overview
Train-test splitting is one of the machine learning basics of dataset preparation. In the given project, the dataset was divided equally in 80-20 proportion between the train dataset and test dataset. That is, 80% of the data were utilized to train the model and 20% data for testing the model performance. It is done to verify how the model works with unseen data. It is of paramount concern that test and training sets must be disjointed, i.e., they don't have a common record with each other, to avoid leakage of any information. Having shared examples between them can lead to artificially excellent accuracy and poor generalization to actual tasks.
Application in Models
For Naive Bayes (NB) models, train-test split was performed before any transformations. Specifically, for MultinomialNB and CategoricalNB, the train set was employed to fit the transformations (e.g., binning or encoding) and then the same transformations were applied to the test set. This avoids learning from the distribution of the test set by the model, keeping the disjoint property of the split intact. The same division was used for every Naive Bayes implementation to maintain consistency.
For Decision Trees (DT), the same train-test split strategy was used. The model was trained on 80% of the dataset using features like ratings, budget, and revenue to construct the tree-based structure. No other transformations were carried out, and hence the raw or scaled numeric values were directly used. No overlap between the train and test sets was ensured, allowing appropriate validation of model performance.
For Logistic Regression, the same train-test disjoint split was performed, but with the additional step of feature standardization through StandardScaler. The scaler was fit only on the training set to learn the mean and the standard deviation, and then used them to transform both the train and test sets. In this way, it was ensured that no statistical information from the test set crept into the training process.
For SVM, the movie dataset was first cleaned by selecting numerical features like duration, ratings, votes, budget, and awards. These features were normalized using StandardScaler so that all values are measured on the same scale, which is important to ensure SVM performance. The data was split into a training set (80%) and an unrelated test set (20%) to train and reasonably evaluate the SVM models.
For XGBoost, the same numerical features were used as for SVM, although a label encoding was used subsequently after standardization in order to convert the movie categories into numerical classes. A stratified 80/20 train-test split was made so that the two sets proportionally contained every category. This preprocessed data was then used in training a multi-class XGBoost model in order to predict the outcome of movie success.
Conclusion
For all algorithms, the same splitting strategy was employed for consistency and fairness in evaluation. The idea is, disjoint splits prevent data leakage, give honest estimates of performance, and simulate how the model would actually perform on actual unseen data.