TAXI-OUT PREDICTION

Project Report

This Project is submitted for fulfillment of Post Graduate Program in Data Science and Engineering course offered by Great Lakes Institute of Management through Great Learning platform.

Submitted on : Feb 2022

INTRODUCTION

CLASSIFICATION MODELS

HYBRID MODELS

DEEP LEARNING - NEURAL NETWORKS

RESULTS & DISCUSSION

FEATURE IMPORTANCE & INTERPRETATION

CONCLUSION

INTRODUCTION

The airline industry encompasses a wide range of businesses called airlines, which offer air transport services for paying customers or business partners. These air transport services are provided for both human travelers, cargo, and are most commonly offered via jets, although some airlines also use helicopters.

Flights incur a large percentage of their delays on the ground during the departure process between the scheduled departure from the gate and takeoff. Because of the large uncertainties associated with them, these delays are difficult to predict and account for, hindering the ability to effectively manage the Air Traffic Control (ATC) system. This project presents an effort to improve the accuracy of estimating the taxi-out time. The method is to identify the main factors that affect the taxi-out time and build a Regression model taking the TAXI-OUT as the target variable.

This file contains data about flights leaving from JKF airport between Nov 2019 and Dec-2020. Taxi-Out prediction has been an important concept as it helps in calculating Runway time and directly impact the cost of the flight.

DATASET

The data used in this project was retrieved from Kaggle which is used to Predicts Taxi_out in the Aviation industry. The Dataset contains the information related to the features of the flights leaving from JKF airport between Nov 2019-Dec-2020 details. The Dataset comprises 23 features and 27,000 data points.

PREPROCESSING

Renaming Columns :

For better understanding

Null Value Imputation :

2 null values found in 'Wind' column. Replaced with category of maximum frequency (W)

Data type conversion :

The feature `Dew Point` has been misinterpreted as “object” data type. It is converted it into “integer” data type.

Removal of Insignificant features :

FLIGHT NO : Large number of categories
CARRIER CODE : Same average Taxi-Out time for all categories
FLT_SCH_ARRIVAL : A new feature called 'TOTAL_SCHEDULED' by combining 'FLT_SCH_ARRIVAL' and 'FLT_SCH_DEPARTURE'.
FLT_SCH_DEPARTURE : A new feature called 'TOTAL_SCHEDULED' by combining 'FLT_SCH_ARRIVAL' and 'FLT_SCH_DEPARTURE'.
ACTUAL_DEP_TIME : DEP_DELAY is calculated as

[DEP_DELAY = ACTUAL_DEP_TIME - SCHEDULED_DEPARTURE_TIME]. We can drop anyone of them.

DISTANCE - we introduce 'AVG_SPEED' (DISTANCE / SCHEDULED_DURATION) as a new column, replacing DISTANCE.

EDA

Univariate Analysis : Numerical features

Univariate Analysis : Categorical features

Bivariate Analysis : Correlation

FEATURE ENGINEERING

Feature Transformation :

Transformed 5 features to reduce the Skewness and increase normality

Outlier Analysis and Treatment :

The data point which are present 1.5 times the IQR away from 1^st and 3^rd Quartile are removed.

Scaling of Numeric Features :

Standard scalar has been used to scale the numeric features, such that the scaled data would have mean 0 and variance 1.

Encoding of Categorical Features :

MONTH : The available data is collected from November to January. Hence, min-max scaling concept is employed.

November as 0, December as 0.5 and January as 1.

DAY_OF_MONTH & DAY_OF_WEEK : They are the ordinal data. Inorder to preserve the order in the data min-max scalar is applied to transform the variable

WIND_GUST : There are just 2 unique values in the variable Wind Gust, one hot encoding is done

DESTINATION, WIND & CONDITION : Frequency encoding has been done.

Train Test Split :

The data is split into Train and Test data with ratio 80:20.

REGRESSION MODELS

Basic Model :

Assumptions of Linear Regression Model :

Data Type of Target Variable -

The target variable is numeric. So, we can use multiple linear regression.

Multi Collinearity -

We can see that condition number is 61.6. Which means there is no multicollinearity among the independent variables.

Linear Relationship between Dependent and Independent Variable -

From the below plots we see that none of the plots show a specific pattern. Hence, we may conclude that the variables are linearly related to the dependent variable

Autocorrelation -

Using Durbin-Watson test, the value of the test statistic was found near to 2, represents no autocorrelation.

Heteroskedasticity -

If the residuals have constant variance across different values of the predicted values, then it is known as Homoskedasticity. The absence of homoskedasticity is known as, heteroskedasticity. One of the assumptions of linear regression is that heteroskedasticity should not be present.

Breusch-Pagan is one of the tests for detecting heteroskedasticity in the residuals.

We observe that the p-value is less than 0.05; thus, we conclude that there is heteroskedasticity present in the data.

Tests for Normality - Jarque bera test.

It is used when the data is more than 5000 rows. It uses skewness and kurtosis to find the normality. If skewness =0 and kurtosis=3 we say normally distributed. If p_val >0.05 - normally distributed.

We observe that the p-value is less than 0.05; thus, we conclude that the data is not normally distributed.

Regression Models Trained :

Linear Regression - Basic Model
Linear Regression - With Significant Features
Stochastic Gradient Descent (SGD)
Lasso Regularization Model
Ridge Regularization Model
Elastic Net Regularization Model

Performances of Regression Models :

All the Linear Regression models are performing bad on the dataset. This is due to violation of Homoskedasticity and Normality assumptions for linear model.

Lasso Regression with 0.071 R-Squared value gives the better performance compared to other models. But, the score is not enough for deployment requirements.

CLASSIFICATION MODELS

Classification Models Trained :

Logistic Regression
KNN Classifier
Decision Tree Classifier
Random Forest Classifier
Ada Boost Classifier
Gradient Boosting Classifier
XGBoost Classifier
Voting Classifier
Stacking Classifier

Binomial Classification - 2 Classes :

Target Variable (Taxi-out) is divided into 2 classes i.e less than 20 minutes and greater than 20 minutes. Then labelled as Classes 'Low' and 'High' respectively.

Count Plot of 2 Classes

Performance of Binomial Classification (2 Classes) model

Multiclass Classification - 3 Classes :

Target Variable (Taxi-out) is divided into 3 classes i.e., less than 17 minutes, Between 17 and 23 minutes and greater than 23 minutes. Then labelled as Classes 'Low', 'Medium' and 'High' respectively.

Count Plot of 3 Classes

Performance of Multiclass Classification (3 Classes) model

Multiclass Classification - 4 Classes :

Target Variable (Taxi-out) is divided into 4 classes i.e., less than 15 minutes, Between 15 and 20 minutes, Between 20 and 25 minutes and greater than 25 minutes. Then labelled as Classes 'Low', 'Medium Low', 'Medium High' and 'High' respectively.

Count Plot of 4 Classes

Performance of Multiclass Classification (4 Classes) model

Comparison of Classification Models :

The performance of classification models degrades with increase in number of classes. This is due to reduction of data for each classes for the machine learning model to learn in higher number of cases. Hence the Binomial Classification is chosen to predict the Taxi-Out run time prediction.

Among the different machine learning models in binomial classification, the Gradient Boosting Classifier with optimized hyper parameters (n_estimators = 250, learning_rate = 0.2, max_depth = 4), tends to have better performances.

It has weighted F1 Score of 0.68 with comparatively less Bias of 0.33 (No signs of under fitting) and less variance of 0.007 (No signs of over fitting).

HYBRID MODELS

A new approach was carried out to improve the performance of the Regression models. It follows a divide and conquer principle to divide the dataset and train separate models for each subset of data.

The train dataset is divided into subsets based on Target (Taxi Out) values. Regression models are separately trained for each subset of data. A Classifier used to predict the class of test data points and then corresponding regression model is applied to predict the Taxi-Out time.

Hybrid Model - Flow Chart

There are two approaches

Binning Approach
Clustering Approach

Binning Approach (2 Bins) :

The dataset is divided into 2 based on Target (Taxi Out) values and Separately modelled.

Class 0 : Taxi-Out less than 20 minutes
Class 1 : Taxi-Out greater than or equal to 20 minutes

Expected Performance

Classification Accuracy : 1.00

R Square Value : 0.66

RMSE : 0.58189

MAPE : 167.42

Actual Performance

Classification Accuracy : 0.68

R Square Value : 0.06

RMSE : 1.02953

MAPE : 222.74

Performance of this hybrid modelling by separation approach using Bins on target data is highly dependent on the performance of Classification model used to select the Sub-Regression model.

If the Classification model gives high accuracy, then the regression prediction can be improved using this method. Otherwise, the performance doesn't increase.

Clustering Approach :

The dataset is divided into several subsets based on Unsupervised Learning (Agglomerative Clustering technique) and regression models will be trained separately for each clusters.

Dendrogram - For forming clusters

From the Dendrogram, It is shown that the optimal number of clusters that can be formed is 5.

Using the cophenetic distance in y axis, It is also seen that 2 cluster division is also as effective as 5 cluster division. hence both 2 clusters and 5 clusters models will be studied.

Cluster Approach - 2 Clusters :

Actual Performance

Classification Accuracy : 0.97

R Square Value : 0.071

RMSE : 0.96353

MAPE : 117.67

Cluster Approach - 5 Clusters :

Actual Performance

Classification Accuracy : 0.94

R Square Value : 0.083

RMSE : 0.95706

MAPE : 119.29

Performance of Hybrid Models :

If Maximum classification accuracy can be attained in Binning approach, then the Regression Performance in Hybrid Model will improve. Otherwise, the Hybrid model has no uses.

In Clustering approach, the increase in regression performance is negligible, even though, the classification performance was very high. Clustering in Hybrid model doesn't increase the overall performance.

DEEP LEARNING - NEURAL NETWORKS

For learning Non Linear patterns of data.

Multi Layer Perception (MLP) Regression :

Optimal Hyper parameter : alpha = 0.1 , hidden layer size = (550, )

Performance

R Square Value : 0.172

RMSE : 0.90441

MAPE : 165.08

Multi Layer Perception (MLP) Classification:

Optimal Hyper parameter : alpha = 0.05 , hidden layer size = (550, )

Performance of Neural Networks :

MLP Regressor performs much better than other Regression models on this dataset. But, The R Squared Score of 0.172 is still not enough for deployment of model

MLP Classifier with accuracy of 0.64, doesn't perform better than Gradient Boosting Classifier.

RESULTS & DISCUSSION

Among Regression models, Lasso Regularization model with optimized hyper parameters performs better with 0.084 and 0.071 R squared values for Train and Test respectively.

But the R-squared values are too low to qualify the regression model. Hence the dataset is labeled to be Non regression friendly.

Among the Classification models with Binomial class target, The Gradient Boosting classifier performs better than other models with 0.78 and 0.68 F1 scores on Train and Test data respectively. The XGBoost model is feared to have over fitted.

Among the Classification models with multiclass target variable, The Gradient Boosting classifier with tuned hyper parameters performs better than other models with 0.70 and 0.41 F1 scores on Train and Test data respectively

On comparing the Binomial and Multiclass models, It is found that the higher number of classes in the target data reduces the convergence of the classification models.
Performance of this hybrid modelling by separation approach using Bins on target data is highly dependent on the performance of Classification model used to select the Sub-Regression model.
If the Classification model gives high accuracy, then the regression prediction can be improved using Hybrid method. Otherwise, the performance doesn't increase.

Multi Layer Perception (Neural Network) models performs better in Regression. But, Not sufficient for requirements.

MLP Classifier model doesn't increases the classification accuracy.

The Classification approach with binomial target variable is decided as the best approach to predict the Taxi-Out runtime of flights.

Gradient Boosting Classification with optimized hyper parameters, performs better on the JFK airport Taxi-Out runtime dataset

FEATURE IMPORTANCE & INTERPRETATION

CONCLUSION

By using taxi-out runtime dataset, our objective was to predict the taxi out was high or low for optimizing flight cost. It was found that different features differ in importance depending on the taxi-out runtime, and some features are not required for predicting taxi out .

So we built a model to classify taxi - out is high or low. This demonstrates that machine learning algorithms, in this case the Gradient Boosting Classifier algorithm with tuned hyper parameters, is a good technique to build taxi - out prediction (Binomial Classification - high or low).

Model development revealed that features had different weights and different importance accordingly to the taxi - out , So we tried building a model which can satisfy all the criteria.

Page updated

Google Sites

Report abuse