Every model and methodology demands particular data formats. Similarly, for the Naive Bayes, the data should be labeled as it is a supervised machine learning algorithm for effective processing. Moreover, it works well when the data is numeric. But since in this case, data is mixed, it is best to use R which works well for mixed data.
Initial data before processing for Naive Bayes
In the context of Naive Bayes, both Python and R is employed. Python is initially utilized for data preparation tasks, ensuring the dataset is appropriately formatted. Then, R takes on the role of modeling Naive Bayes, to predict the flight delay based on the given input data. The Naive Bayes algorithm leverages the Naive Bayes theorem to calculate the posterior probability and predict the values based on these instances.
The dataset initially consisted of many columns, but the relevant features considered for analysis include the origin_state, destination_state, origin_temperature, destination_temperature, and the total_weather_delay. Total_weather_delay is a customized column that sums up both the origin_weather delay and the destination_weather delay. Then it's further converted into labeled data using a function that categorizes them into two categories: Short Delay and Extended delay based on the hours of delay. The short delay includes flight delays that are less than 1 hour, and extended delays are delays that include flight delays more than 1 hour.
The dataset is divided into two disjoint subsets, with a split of 70% for the training set and 30% for the testing set. These datasets are disjoint because it's important they don't share data because using the same data for both can make the model seem more accurate than it is, which isn't good.
Final input data after processing for Naive Bayes
Sample train data after processing for Naive Bayes
Sample test data after processing for Naive Bayes
Another important aspect to be checked while modeling the Naive Bayes is to check for class imbalance. The class imbalance for both these categories is checked. Since the ratio of the categories is almost one, there is no class imbalance impact, and the data can be used for modeling.
Checking the class imbalance of categories
Furthermore, once all these steps are verified, the data can be now modeled using the Naive Bayes algorithm.