This Project is submitted for fulfillment of Post Graduate Program in Data Science and Engineering course offered by Great Lakes Institute of Management through Great Learning platform.
Submitted on : Feb 2022
The airline industry encompasses a wide range of businesses called airlines, which offer air transport services for paying customers or business partners. These air transport services are provided for both human travelers, cargo, and are most commonly offered via jets, although some airlines also use helicopters.
Flights incur a large percentage of their delays on the ground during the departure process between the scheduled departure from the gate and takeoff. Because of the large uncertainties associated with them, these delays are difficult to predict and account for, hindering the ability to effectively manage the Air Traffic Control (ATC) system. This project presents an effort to improve the accuracy of estimating the taxi-out time. The method is to identify the main factors that affect the taxi-out time and build a Regression model taking the TAXI-OUT as the target variable.
This file contains data about flights leaving from JKF airport between Nov 2019 and Dec-2020. Taxi-Out prediction has been an important concept as it helps in calculating Runway time and directly impact the cost of the flight.
The data used in this project was retrieved from Kaggle which is used to Predicts Taxi_out in the Aviation industry. The Dataset contains the information related to the features of the flights leaving from JKF airport between Nov 2019-Dec-2020 details. The Dataset comprises 23 features and 27,000 data points.
Renaming Columns :
For better understanding
Null Value Imputation :
2 null values found in 'Wind' column. Replaced with category of maximum frequency (W)
Data type conversion :
The feature `Dew Point` has been misinterpreted as “object” data type. It is converted it into “integer” data type.
Removal of Insignificant features :
FLIGHT NO : Large number of categories
CARRIER CODE : Same average Taxi-Out time for all categories
FLT_SCH_ARRIVAL : A new feature called 'TOTAL_SCHEDULED' by combining 'FLT_SCH_ARRIVAL' and 'FLT_SCH_DEPARTURE'.
FLT_SCH_DEPARTURE : A new feature called 'TOTAL_SCHEDULED' by combining 'FLT_SCH_ARRIVAL' and 'FLT_SCH_DEPARTURE'.
ACTUAL_DEP_TIME : DEP_DELAY is calculated as
[DEP_DELAY = ACTUAL_DEP_TIME - SCHEDULED_DEPARTURE_TIME]. We can drop anyone of them.
DISTANCE - we introduce 'AVG_SPEED' (DISTANCE / SCHEDULED_DURATION) as a new column, replacing DISTANCE.
Transformed 5 features to reduce the Skewness and increase normality
The data point which are present 1.5 times the IQR away from 1st and 3rd Quartile are removed.
Standard scalar has been used to scale the numeric features, such that the scaled data would have mean 0 and variance 1.
MONTH : The available data is collected from November to January. Hence, min-max scaling concept is employed.
November as 0, December as 0.5 and January as 1.
DAY_OF_MONTH & DAY_OF_WEEK : They are the ordinal data. Inorder to preserve the order in the data min-max scalar is applied to transform the variable
WIND_GUST : There are just 2 unique values in the variable Wind Gust, one hot encoding is done
DESTINATION, WIND & CONDITION : Frequency encoding has been done.
The data is split into Train and Test data with ratio 80:20.
Data Type of Target Variable -
The target variable is numeric. So, we can use multiple linear regression.
Multi Collinearity -
We can see that condition number is 61.6. Which means there is no multicollinearity among the independent variables.
Linear Relationship between Dependent and Independent Variable -
From the below plots we see that none of the plots show a specific pattern. Hence, we may conclude that the variables are linearly related to the dependent variable
Autocorrelation -
Using Durbin-Watson test, the value of the test statistic was found near to 2, represents no autocorrelation.
Heteroskedasticity -
If the residuals have constant variance across different values of the predicted values, then it is known as Homoskedasticity. The absence of homoskedasticity is known as, heteroskedasticity. One of the assumptions of linear regression is that heteroskedasticity should not be present.
Breusch-Pagan is one of the tests for detecting heteroskedasticity in the residuals.
We observe that the p-value is less than 0.05; thus, we conclude that there is heteroskedasticity present in the data.
Tests for Normality - Jarque bera test.
It is used when the data is more than 5000 rows. It uses skewness and kurtosis to find the normality. If skewness =0 and kurtosis=3 we say normally distributed. If p_val >0.05 - normally distributed.
We observe that the p-value is less than 0.05; thus, we conclude that the data is not normally distributed.
Linear Regression - Basic Model
Linear Regression - With Significant Features
Stochastic Gradient Descent (SGD)
Lasso Regularization Model
Ridge Regularization Model
Elastic Net Regularization Model
All the Linear Regression models are performing bad on the dataset. This is due to violation of Homoskedasticity and Normality assumptions for linear model.
Lasso Regression with 0.071 R-Squared value gives the better performance compared to other models. But, the score is not enough for deployment requirements.
Logistic Regression
KNN Classifier
Decision Tree Classifier
Random Forest Classifier
Ada Boost Classifier
Gradient Boosting Classifier
XGBoost Classifier
Voting Classifier
Stacking Classifier
Target Variable (Taxi-out) is divided into 2 classes i.e less than 20 minutes and greater than 20 minutes. Then labelled as Classes 'Low' and 'High' respectively.
Count Plot of 2 Classes
Performance of Binomial Classification (2 Classes) model
Target Variable (Taxi-out) is divided into 3 classes i.e., less than 17 minutes, Between 17 and 23 minutes and greater than 23 minutes. Then labelled as Classes 'Low', 'Medium' and 'High' respectively.
Count Plot of 3 Classes
Performance of Multiclass Classification (3 Classes) model
Target Variable (Taxi-out) is divided into 4 classes i.e., less than 15 minutes, Between 15 and 20 minutes, Between 20 and 25 minutes and greater than 25 minutes. Then labelled as Classes 'Low', 'Medium Low', 'Medium High' and 'High' respectively.
Count Plot of 4 Classes
Performance of Multiclass Classification (4 Classes) model
The performance of classification models degrades with increase in number of classes. This is due to reduction of data for each classes for the machine learning model to learn in higher number of cases. Hence the Binomial Classification is chosen to predict the Taxi-Out run time prediction.
Among the different machine learning models in binomial classification, the Gradient Boosting Classifier with optimized hyper parameters (n_estimators = 250, learning_rate = 0.2, max_depth = 4), tends to have better performances.
It has weighted F1 Score of 0.68 with comparatively less Bias of 0.33 (No signs of under fitting) and less variance of 0.007 (No signs of over fitting).
A new approach was carried out to improve the performance of the Regression models. It follows a divide and conquer principle to divide the dataset and train separate models for each subset of data.
The train dataset is divided into subsets based on Target (Taxi Out) values. Regression models are separately trained for each subset of data. A Classifier used to predict the class of test data points and then corresponding regression model is applied to predict the Taxi-Out time.
Hybrid Model - Flow Chart
There are two approaches
Binning Approach
Clustering Approach
The dataset is divided into 2 based on Target (Taxi Out) values and Separately modelled.
Class 0 : Taxi-Out less than 20 minutes
Class 1 : Taxi-Out greater than or equal to 20 minutes
Expected Performance
Classification Accuracy : 1.00
R Square Value : 0.66
RMSE : 0.58189
MAPE : 167.42
Actual Performance
Classification Accuracy : 0.68
R Square Value : 0.06
RMSE : 1.02953
MAPE : 222.74
Performance of this hybrid modelling by separation approach using Bins on target data is highly dependent on the performance of Classification model used to select the Sub-Regression model.
If the Classification model gives high accuracy, then the regression prediction can be improved using this method. Otherwise, the performance doesn't increase.
Dendrogram - For forming clusters
From the Dendrogram, It is shown that the optimal number of clusters that can be formed is 5.
Using the cophenetic distance in y axis, It is also seen that 2 cluster division is also as effective as 5 cluster division. hence both 2 clusters and 5 clusters models will be studied.
Actual Performance
Classification Accuracy : 0.97
R Square Value : 0.071
RMSE : 0.96353
MAPE : 117.67
Actual Performance
Classification Accuracy : 0.94
R Square Value : 0.083
RMSE : 0.95706
MAPE : 119.29
For learning Non Linear patterns of data.
Optimal Hyper parameter : alpha = 0.1 , hidden layer size = (550, )
Performance
R Square Value : 0.172
RMSE : 0.90441
MAPE : 165.08
Optimal Hyper parameter : alpha = 0.05 , hidden layer size = (550, )
MLP Regressor performs much better than other Regression models on this dataset. But, The R Squared Score of 0.172 is still not enough for deployment of model
MLP Classifier with accuracy of 0.64, doesn't perform better than Gradient Boosting Classifier.
Among Regression models, Lasso Regularization model with optimized hyper parameters performs better with 0.084 and 0.071 R squared values for Train and Test respectively.
But the R-squared values are too low to qualify the regression model. Hence the dataset is labeled to be Non regression friendly.
Among the Classification models with Binomial class target, The Gradient Boosting classifier performs better than other models with 0.78 and 0.68 F1 scores on Train and Test data respectively. The XGBoost model is feared to have over fitted.
Among the Classification models with multiclass target variable, The Gradient Boosting classifier with tuned hyper parameters performs better than other models with 0.70 and 0.41 F1 scores on Train and Test data respectively
On comparing the Binomial and Multiclass models, It is found that the higher number of classes in the target data reduces the convergence of the classification models.
Performance of this hybrid modelling by separation approach using Bins on target data is highly dependent on the performance of Classification model used to select the Sub-Regression model.
If the Classification model gives high accuracy, then the regression prediction can be improved using Hybrid method. Otherwise, the performance doesn't increase.
Multi Layer Perception (Neural Network) models performs better in Regression. But, Not sufficient for requirements.
MLP Classifier model doesn't increases the classification accuracy.
The Classification approach with binomial target variable is decided as the best approach to predict the Taxi-Out runtime of flights.
Gradient Boosting Classification with optimized hyper parameters, performs better on the JFK airport Taxi-Out runtime dataset
By using taxi-out runtime dataset, our objective was to predict the taxi out was high or low for optimizing flight cost. It was found that different features differ in importance depending on the taxi-out runtime, and some features are not required for predicting taxi out .
So we built a model to classify taxi - out is high or low. This demonstrates that machine learning algorithms, in this case the Gradient Boosting Classifier algorithm with tuned hyper parameters, is a good technique to build taxi - out prediction (Binomial Classification - high or low).
Model development revealed that features had different weights and different importance accordingly to the taxi - out , So we tried building a model which can satisfy all the criteria.