Once the processed data is utilized in Python, the Random Forest Algorithm is used to model the data. Data is divided into two disjoint datasets in the proportion of 70 and 30 ( 70 for the training set and 30 for the testing set). The dataset is also created using stratified sampling which implies that the ratio of the classes is maintained as same in both training and testing sets. The parameters set for the model are:
criterion: entropy and gini
number of estimators: 100
The results of the model are shown below.
Once the model is built, it is important to visualize the feature importance of the model. This plot illustrates which feature has the weightage over the other features in which the model has trained.
From this graph, it is clear that the model has given more importance to destination_temperatures and origin_temperatures. It has given the least importance to other features like both origin and destination latitudes and longitudes.
Feature Importance of Random Forest Classifier
In the model that has been built, we have used 100 different decision trees. One of the decision trees from the model is selected randomly and is visualized below. We can see that entropy is the criteria and we can see how node splitting occurs. We can see how the data is partitioned. In this tree destination_longitude <= -155.079 is used as the root node and samples are being split based on different conditions.
Random Decision Tree estimator using entropy criterion and number of estimators = 100 and maximum depth = 3
This is another decision trees from the model which is selected randomly and is visualized below. We can see that gini is the criteria and we can see how node splitting occurs. We can see how the data is partitioned. In this tree destination_longitude <= -155.079 is used as the root node and samples are being split based on different conditions.
Random Decision Tree estimator using gini criterion and number of estimators = 100 and maximum depth = 3
This is another decision tree from the model which is selected randomly and is visualized below. But unlike both the decision trees above this is quite different because, in this case, the maximum depth parameter is not set which implies the tree will be constructed in such a way that all the points are partitioned and leaf/terminal node of the tree will have entropy 0. In this tree destination_temperature ≤ 62.483 is used as the root node and samples are being split based on different conditions.
Whole view of Random Decision Tree estimator using entropy criterion and number of estimators = 100
Zoomed view of Random Decision Tree estimator using entropy criterion and number of estimators = 100
Link for full image : Image
Confusion matrix shows how well the classification model has performed. The columns represent the actual values of the target variable (total weather delay), and the rows represent the predicted values by the model.
From the confusion matrix, it is evident that the model has correctly predicted 14304 instances of the extended delay class and has wrongly predicted 8369 instances as belonging to the extended delay class when they belonged to the short delay class. Similarly, the model has correctly predicted 9892 instances of the short delay class and has wrongly predicted 9704 instances as belonging to the short delay class when they belonged to the extended delay class.
Overall, the model doesn't perform well in identifying total weather delays.
The classification report displays the result of how well the model has classified and performed in the classification tasks. It displays various values like accuracy, precision, recall, and F1 score.
From this report, we can see that the accuracy of the model is just 57%. And the F1 score of each of the classes is 0.61 and 0.52 which is decent but not so good.
It can be concluded that this model doesn't perform well in identifying total weather delays.