Naive Bayes requires labeled data, meaning each instance in the dataset must have a known target class or category. This is essential because Naive Bayes is a supervised learning algorithm, which learns the relationship between input features and the target labels during training. The model uses these labeled examples to estimate the prior probabilities of each class and the likelihoods of features given those classes. Without labeled data, the algorithm cannot calculate the necessary probabilities to perform classification. Therefore, a properly labeled training set is critical for building an accurate and effective Naive Bayes model.
How the Train Test Split Was Created
To evaluate the performance of the Naive Bayes the dataset was divided into two parts: a training set and a testing set. This was done using the train_test_split() function from the scikit-learn library, with 70% of the data allocated for training and the remaining 30% for testing. The training set is used to teach the model patterns within the data, while the testing set is used to measure how well the model performs on new, unseen data. The split was applied randomly to ensure that both sets represent the overall data distribution, especially for the target variable.
Why the Training and Testing Data Must Be Disjoint
It is critical that the training and testing sets are completely disjoint, meaning they do not contain any overlapping rows. A disjoint split ensures that the model is evaluated on data it has never seen before, providing a realistic assessment of its ability to generalize to new inputs. If the same data were present in both sets, the model might simply memorize the answers instead of learning the underlying patterns, leading to misleadingly high accuracy during testing. This phenomenon, known as data leakage, can result in poor real-world performance. Therefore, maintaining a clean separation between training and testing data is essential for building trustworthy and robust models.