How the Train-Test Split Was Created
To evaluate the performance of the Decision tree the dataset was divided into two parts: a training set and a testing set. This was done using the train_test_split() function from the scikit-learn library, with 70% of the data allocated for training and the remaining 30% for testing. The training set is used to teach the model patterns within the data, while the testing set is used to measure how well the model performs on new, unseen data. The split was applied randomly to ensure that both sets represent the overall data distribution, especially for the target variable.
Why the Training and Testing Data Must Be Disjoint
It is critical that the training and testing sets are completely disjoint, meaning they do not contain any overlapping rows. A disjoint split ensures that the model is evaluated on data it has never seen before, providing a realistic assessment of its ability to generalize to new inputs. If the same data were present in both sets, the model might simply memorize the answers instead of learning the underlying patterns, leading to misleadingly high accuracy during testing. This phenomenon, known as data leakage, can result in poor real-world performance. Therefore, maintaining a clean separation between training and testing data is essential for building trustworthy and robust models.