Since probability of flight delay or on time have been chosen as target data, so two prediction algorithms are select to build a predictive model which are decision tree and random forest. The predictive model are built using Colab and the Rapidminer to compare the performance like accuracy and precision. We investigate the performance by using 3 different train and test ratio which is 70:30, 50:50 and 30:70 for both model. At the end, one of the model will be chosen for deployment.
Initially , we try 3 predictive model which is decision tree, random forest and logistic regression. However we found that logistic regression does not improve the accuracy after tuning, so we only choose 2 predictive model. To obtain better performance, we did many sample of data cleaning, around 7 sample we have done and pick the best performance one(PFBI clean using FILTER and replace5).
Decision Tree
Figure below shows the step of building decision tree model using Rapidminer.
Run the process, to view apply model table and performance vector like accuracy and precision. It predict the flight either delay or not delay base on the 7 attributes.
Decision Tree
The root of the decision tree is "Asia". If the "Asia" is 1 and "ACF_VERSION" less than equal to 8, the flight probably on time.
If the "ASIA" is 0, "BUSINESS_CLASS" is 1 and "ACF_VERSION" is less than equal to 25, the flight probably delay.
Random Forest
Figure below shows the step of building random forest model using Rapidminer.
Run the process, to view apply model table and performance vector like accuracy and precision.
The model predicts total 2401 flights are on time (0), 1832 flights are delay (1)
Figure below shows one of the random forest graph
Hyperparameter tuning
The purpose of performing hyperparameter tuning is to find an optimal combination of hyperparameters that minimizes a predefined loss function to improve the accuracy of the model. Figure below shows design of hyperparameter tuning for decision tree:
Optimize parameter (Grid) is used to find the optimal values of the selected parameters for the operators in its sub process. For example the maximal_depth and minimal_leaf_size use 10 step.By select 3 parameters, it select 242 combinations.
Optimize parameter (Grid) is used. The maximal_depth use 20 steps and the minimal_leaf_size use 10 steps. The criterion select info_gain, gini_index and accuracy. By select these parameter, there are 63 combinations.
Result after tuning
The decision tree model predict 711 rows of data are 0 (flight on time) and 569 rows of data are 1 (flight delay). Hence it got a 70.56% accuracy
The random forest model predict 780 rows of data are 0 (flight on time) and 568 rows of data are 1 (flight delay). Hence it got a 74.31% accuracy
Base on the result of hyperparameter tuning in Rapidminer, random forest model's performance is better than decision tree model. The predicted flight delay(1) are almost the same.
Click this link to view the Google Colab :