We cleaned and prepared the dataset by removing unnecessary columns like ID and converting categorical variables into numerical form using label encoding. A new binary target column was created to classify anxiety severity into high and low. After this, the data was split into training and testing sets using an 80-20 ratio. All parts of the transformed data including features and labels for both splits were saved separately. We also saved a combined version of the dataset with a column indicating whether each row belongs to the training or testing set.
How the Train-Test Split Was Created
To evaluate the performance of the XGBoost model, the dataset was divided into two parts: a training set and a testing set. The train_test_split() function from scikit-learn was used, with 80% of the data allocated for training and 20% for testing. The training set helps the model learn the underlying patterns in the data, while the testing set is used to evaluate how well the model performs on new, unseen data. The split was done randomly to ensure that both sets are representative of the overall data distribution, especially for the target variable. We also performed hyperparameter tuning using randomized search to improve the model's accuracy.
Is the Train Test split different or same?
The train-test split used for XGBoost was 80-20, where 80% of the data was allocated for training and 20% for testing. This split was chosen to provide the model with sufficient data for learning while leaving enough unseen data to evaluate its performance effectively. Unlike other models, where a 70-30 split was used, the 80-20 split for XGBoost allows more data for training, which is important for boosting models that tend to perform better with larger training datasets.
Why the Training and Testing Data Must Be Disjoint
It is critical that the training and testing sets are completely disjoint, meaning they do not contain any overlapping rows. A disjoint split ensures that the model is evaluated on data it has never seen before, providing a realistic assessment of its ability to generalize to new inputs. If the same data were present in both sets, the model might simply memorize the answers instead of learning the underlying patterns, leading to misleadingly high accuracy during testing. This phenomenon, known as data leakage, can result in poor real-world performance. Therefore, maintaining a clean separation between training and testing data is essential for building trustworthy and robust models.
Code & Results