Every model and methodology demands particular data formats. Similarly, for the decision trees, the data should be labeled as it is a supervised machine learning algorithm for effective processing. Moreover, it works well when the data is numeric. But since in this case, data is mixed, it is best to use Random Forest which is best for mixed data types, or the other option is to leverage R, which works well for mixed data.
Initial data before processing for Decision Trees
In the context of Decision Trees, Python is employed. Python is initially utilized for data preparation tasks, ensuring the dataset is appropriately formatted. Then, it takes on the role of modeling Decision trees, discovering rules for partition, and creating visualizations to represent the trees. Given the mixed nature of the data, the decision was made to utilize Random Forest for modeling, a specialized algorithm consisting of multiple decision trees, where the final classification is determined by the majority of their collective votes.
The dataset initially consisted of many columns, but the relevant features considered for analysis include the origin_latitude, origin_longitude, destination_latitude, destination_longitude, origin_temperature, destination_temperature, and the total_weather_delay. Total_weather_delay is a customized column that sums up both the origin_weather delay and the destination_weather delay. Then it's further converted into labeled data using a function that categorizes them into two categories: Short Delay and Extended delay based on the hours of delay. The short delay includes flight delays that are less than 1 hour, and extended delays are delays that include flight delays more than 1 hour.
The dataset is divided into two disjoint subsets, with a split of 70% for the training set and 30% for the testing set. These datasets are disjoint because it's important they don't share data because using the same data for both can make the model seem more accurate than it is, which isn't good.
Final input data after processing for Decision Trees
Sample train data after processing for Decision Trees
Sample test data after processing for Decision Trees
Another important aspect to be checked while modeling the decision trees is to check for class imbalance. The class imbalance for both these categories is checked. Since the ratio of the categories is almost one, there is no class imbalance impact, and the data can be used for modeling.
Checking the class imbalance of categories
Furthermore, once all these steps are verified, the data can be now modeled using the Decision tree algorithm.