Deliverable 1 -
Project Introduction
Deliverable 2 -
EDA & Initial ML
Deliverable 3 -
Machine Learning
Project Introduction
Overview
Over the last two decades, the popularity of air travel has increased significantly among travelers, mostly because of its speed in comparison to other modes of transportation. This has led to increase in traffic in the air and on the ground, which further has resulted in massive levels of aircraft delays [1]. Flight delays cost billions of dollars and have a huge impact on the US economy, causing a stain on the air travel system, passengers and society [2].
In this Capstone project my aim is to apply machine learning algorithms like decision tree, random forest and logistic regression to predict flight delays. I will train the models on one dataset and test on another, to check the accuracy of the models in predicting the flight delays. The research question here is that - Is my model able to predict which flight will delay?
Image Credit: Ian Cassidy [8]
Data Description
The datasets I am using were acquired from United States Department of Transportation’s (DOT) Bureau of Transportation Statistics (BTS) website. BTS tracks the on-time performance of domestic flights operated by large air carriers. It provides datasets that are focused on the number of on-time, delayed, canceled and diverted flights that appear in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end. The website contains data starting from June 2003 till November 2020 [3].
For my analysis, I will use 5 datasets. All 5 datasets contain the same fields but represent different time slices. I will use 4 datasets for Phase 1 & 2 for Exploratory Data Analysis and Training my models. For Phase 3, I will use the 5th dataset to test my models on and check their accuracies. These datasets contain flight delay statistics ranging from January 2015 – January 2019. After concatenating January 2015 – January 2018 datasets into 1 dataframe, the dataset includes 1,935,930 rows and 51 columns for my Phase 1 & 2 analysis. After further exploration of the data on the BTS website, I have decided to use 29 out of the 51 columns for my analysis [4] –
1. YEAR – Year
2. DAY_OF_MONTH – Day of the Month
3. DAY_OF_WEEK – Day of the Week
4. FL_DATE – Flight Date (yyyymmdd)
5. OP_UNIQUE_CARRIER – Reporting Airline
6. ORIGIN – Origin Airport
7. ORIGIN_WAC – Origin Airport, World Area Code
8. DEST – Destination Airport
9. DEST_WAC – Destination Airport, World Area Code
10. DEP_TIME – Actual Departure Time (local time: hhmm)
11. DEP_DELAY – Difference in minutes between scheduled and actual departure time. Early departures show negative numbers
12. DEP_DEL15 – Departure Delay Indicator, 15 Minutes or More (1=Yes)
13. TAXI_OUT – Taxi out Time, in Minutes
14. TAXI_IN – Taxi in Time, in Minutes
15. WHEELS_OFF – Wheels Off Time (local time: hhmm)
16. WHEELS_ON – Wheels on Time (local time: hhmm)
17. ARR_TIME – Actual Arrival Time (local time: hhmm)
18. ARR_DELAY – Difference in minutes between scheduled and actual arrival time. Early arrivals show negative numbers
19. ARR_DEL15 – Arrival Delay Indicator, 15 Minutes or More (1=Yes)
20. CANCELLED – Cancelled Flight Indicator (1=Yes)
21. CANCELLATION_CODE – Specifies the Reason for Cancellation
22. DIVERTED – Diverted Flight Indicator (1=Yes)
23. AIR_TIME – Flight Time, in Minutes
24. DISTANCE – Distance between airports (miles)
25. CARRIER_DELAY – Carrier Delay, in Minutes
26. WEATHER_DELAY – Weather Delay, in Minutes
27. NAS_DELAY – National Air System Delay, in Minutes
28. SECURITY_DELAY – Security Delay, in Minutes
29. LATE_AIRCRAFT_DELAY – Late Aircraft Delay, in Minutes
Image Credit: Assist [9]
I plan to dive a bit deeper by the next delivery deadline analyzing details about the types of questions I have on the dataset and complete my Exploratory Data Analysis.
Implementation Details
The aim of this Capstone project is to predict flight delays using machine learning algorithms like decision tree, random forest and logistic regression. Thus, I chose 29 out of 51 features out of which, some are usually known in advance like – Day, Day of the week, Carrier, Origin airport, Destination Airport, Scheduled departure, Departure delay, taxi-out/in, Distance, Scheduled arrival etc.
My plan is to train the models on one dataset and test on another, to check the accuracy of the models in predicting flight delays.
Data Cleaning and Exploratory Data Analysis (EDA) –
I will use Pandas, Matplotlib, NumPy and Seaborn libraries, to name a few for my initial EDA. I will also use Tableau to create a data story with interactive visuals that provide useful insights to anyone viewing them.
Machine Learning –
I will be using decision tree, random forest and logistic regression to predict flight delays. To apply the algorithms, I will use SciKit-Learn library in Python for testing and training and SK-Learn library to import all the methods of classification algorithms. After training my models on January 2015 – January 2018 dataset, I will test the models on the January 2019 dataset. And finally, can also use Confusion Matrix to check the accuracy as confusion matrix is a way of tabulating the number of misclassifications.
Image Credit: Shutter Stock [10]
Related Work
Many algorithms have been introduced to predict flight delays. Growing delays threaten the competitiveness of the U.S. in the world economy by limiting the ability of the air transport system to serve the needs of the U.S. economy. In addition to improving business performance generally, air transport impacts the economy through the jobs and revenue it directly creates in air transport-related industries, the expenditures of air travelers on auxiliary goods and services, and the secondary impacts that result as these dollars recycle throughout the economy [5].
There have been many studies in this area and several researchers have attempted machine learning models to predict delays though extracting important characteristics and the most related features. However, due to massive volumes of data most of the methods are not accurate. One recent example is of Yazdi, Kamel, Chabok and Kheirabadi, who used Deep Learning and Levenberg - Marquart algorithm for predicting flight delays. Their results showed that the proposed model in forecasting flight delay on imbalanced and balanced dataset respectively had greater accuracy than the previous model RNN [6]. Another example is of Herbas, who used ML Random Forest and Deep Neural Networks, which resulted an accuracy of 86%, if a flight will be delayed or not and 84% of Recall and 86% Precision [7].
Presentation - Phase 1
References
Kuhn, N., & Jamadagni, N. (2007). Application of Machine Learning Algorithms to Predict Flight Arrival Delays: http://cs229.stanford.edu/proj2017/final-reports/5243248.pdf
Michael Ball, C. B. (2010). A Comprehensive Assessment of the Costs and Impacts of Flight Delay in the United States. The National Center of Excellence:
https://cpb-us-e1.wpmucdn.com/blog.umd.edu/dist/9/604/files/2019/09/TDI_Report_Final_11_03_10.pdf
NEXTOR. (2010). Annual U.S. Impact of Flight Delays. Airline for America:
https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
Protection, O. o. (2003, June). Bureau of Transportation Statistics. Retrieved from United States Department of Trasportation:
https://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp
Protection, O. o. (2015-2019, January). Bureau of Transportation Statistics. Retrieved from United States Department of Transportation:
https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
Yazdi, M. F., Kamel, S. R., Chabok, S. J., & Kheirabadi , M. (2020). Flight delay prediction based on deep learning and Levenberg-Marquart algorithm. Journal of Big Data:
https://journalofbigdata.springeropen.com/articles/10.1186/s40537-020-00380-z
Herbas, J. (2020). Using Machine Learning to Predict Flight Delays. Analytics Vidhya:
https://medium.com/analytics-vidhya/using-machine-learning-to-predict-flight-delays-e8a50b0bb64c
Ian Cassidy: https://engineering.upside.com/applying-predictive-analytics-to-flight-delays-85413ca4939f
Assist: https://atassist.com/blog/data-collection-the-good-the-bad-and-the-value
Shutter Stock: https://www.shutterstock.com/search/implementation+plan
Exploratory Data Analysis & Machine Learning
Exploratory Data Analysis
- Data Loading - Data Cleaning - Data Visualizations
Data Loading
For this part, I used Google Collaboratory notebook with python programming language to write the script. After importing the libraries, the initial challenge was loading the csv files. I downloaded 4 separate files, contain the same fields but represent different time slices i.e., January 2015, January 2016, January 2017 and January 2018. The initial challenge was loading the files to google collaboratory, as each file was approximately 135 MB or more in size. Thus, loading the files to collab took me more than 1.5 hours every time I wanted to work on my notebook.
After loading the csv files, I concatenated all four files into one dataframe, which resulted in 1,935,930 rows and 51 columns. Further exploration on the Bureau of Transportation Statistics (BTS) website, led me to choose 29 out of the 51 columns for my analysis.
Data Cleaning
In this section, I explored and cleaned the data for later modeling. In the image on the above left, we can see that majority of the columns have missing values. Just deleting all these rows will take away a major chunk of the data, thus I decided to replace the missing values.
Since all the ‘Delay’ columns will be a major part of my analysis, and a major chunk of the data in these columns is missing. I converted these missing values to 0
Reasonably, it does not make sense to replace missing values for ‘Taxi in/out’, ‘Wheels on/off’ with 0. Thus, I replaced them with mean values of the respective columns
Removing the column ‘Cancellation Code’, as 90% of the values are missing
And lastly, left with ‘Arrival’, ‘Delay’ and ‘Air-Time’ columns. It does not make sense to replace the missing values with 0. Thus, I just dropped them
Result – The image above right, shows a heat map to check if there are any missing values left. The dataset is now clean and ready to use for further exploration and modeling. The concatenated dataset now has 1,881,656 rows and 28 columns for my analysis. I performed the same steps of data cleaning on the January 2019 file as well, which will be used later for testing my models on.
Data Visualization
For visualizations, I have used Seaborn library - which is a Python data visualization library based on matplotlib, to plot graphs and charts. I have also used Tableau to create interactive visuals.
Figure 1
In this graph, through a count plot comparison between the years, we observe that 2018 had the maximum number of flights with a count of 551,700 flights, followed by 2015 with the second highest flight count. 2016 had the least flight count in comparison to all four years.
Figure 2
This graph is a comparison of flight count between the carriers i.e., the airlines. We see that WN (Southwest) had the highest flight count all four years, out of the 19 unique carriers in our dataset. Also, we see that on the extreme right of the graph, the carriers just show data for 2018 flight count, which can either mean that there is no data recorded for the other years or there we no flight by those airlines before 2018.
Figure 3
Figure 4
The main topic for my project is flight delays. Through the above pie chart we can see the percentage delay by each airline through 2015 – 2018. The top three airlines that had the major percentage delays are - WN (Southwest), DL (Delta Airlines) and OO (SkyWest Airlines). And in figure 4, we see the count of flight delays each year. We observe an alternate pattern i.e., from 2015 – 2016, there is a decrease in the flight delays. But the delays increases in 2017 and then again decreases a little in 2018.
Figure 5
Above is a pairplot - showing the correlation between ‘late aircraft delay’ and reasons for the delays. The graphs for Departure and Arrival delays suggests that they are directly proportional to causing late aircraft.
There is a strong correlation between air-time and distance. But there is too much data, thus needs further exploration.
Figure 6
Figure 7
I also created a Tableau dashboard for interactive user visuals. On my dashboard along with some plots similar to the one’s I plotted through python; you can also see the airports ranked in order of their flight count. Thus, ATL, ORD and DFW were the busiest airports with the highest flight count over 2015 – 2018 according to the data. We can also see the cumulative departure delay for the top 3 airports (ATL, ORD and DFW). I also plotted a comparison of Departure vs Arrivial delay per airline. And lastly, there are graphs showing the reasons for flight delay per day, month and year.
Dataset Challenges
In the earlier section – Data loading, I explained the challenge of loading the csv files to Google Collaboratory. As each file was approximately 135 MB or more in size, loading the files to collab took me more than 1.5 hours every time I wanted to work on my notebook.
Further, after concatenating the four csv files into one dataframe resulted in 1,935,930 rows. I needed this file for training my data on, however due to the file being extremely large, it took forever to load and session crashed after using all available RAM, every time I tired loading the data. Thus, I was not able to load the file to google collab notebook at all. For phase 3, I plan on buying the one-month subscription on collab or try using Anaconda Jupyter notebook.
Since I was unable to use the concatenated dataframe, I instead decided to slice the data into half for easier uploading. I couldn’t just slice the data from the middle as the data was sorted according to the years i.e., the beginning 50% of the dataframe had 2015-2016 data and the latter 50% had 2017-2018 data. Therefore, I created a new dataframe with random 50% sample of the concatenated data, for my initial machine learning analysis.
I also faced issues while creating the Tableau dashboard. This was my first time creating a dashboard on Tableau, so I had to self-learn, research and then create the dashboard. It was a slow but rewarding process. The other challenge I faced was that the server sometimes crashed and that deleted my progress halfway, as I had not saved the sheets with the graphs at that point.
Initial Machine Learning Analysis
I trained the random sample collected from the concatenated file tested on the January 2019 clean data. I applied two machine learning algorithms - Logistic Regression and SGD classifier, however, both the models predicted high accuracies – 0.98 and 0.99 respectively. This might be due to the models overfitting cause testing data is more than usual. Since I was unable to use concatenated 2015-2018 data and instead that to use 50%, the testing data is almost half of training data. Whereas it should be 25% the training data. In simple words, I need more training data for better results in the next phase of the project.
In the Phase 3 of the project, I will apply supervised learning and neural networks, along with decision tree and random forest.
Presentation - Phase 2
About image references: All images in phase 2 P3 and notebooks of this project delivery are created by me
Machine Learning
By using Machine Learning (ML) Algorithms we can try to predict if the flight will be delayed. While using different algorithms, I did face undeniable challenges and a certain degree of accuracy, which is associated to the data that they are fed. In this phase, I looked at different ML techniques/algorithms to try to predict if a flight will be delayed. Along with getting the highest accuracy, my results are focused on the top 3 airports and top 3 airlines during January 2015 – 2019.
Methodology
Data Preprocessing
In phase 2, I explained the ram issues I faced and after consulting with my professor Dr. Simsek, who suggested to narrow down my data by focusing on certain airports. I decided to narrow down my training data to the top 3 airports in my dataset according to my Exploratory Data Analysis (EDA) i.e., ‘ATL’, ‘ORD’ and ‘DFW’. These where the airports with the highest flight count during January 2015 – 2018. From 1.9M rows, my train data was now down to 277,372 rows.
I further narrowed down my testing data to focus on the top 3 airlines according to my Exploratory Data Analysis (EDA) i.e., ‘WN’ (Southwest), ‘DL’ (Delta), and ‘OO’ (SkyWest) to have more focused results.
Machine Learning Algorithms
Approach 1 –
My first approach was to test and train on the same dataset i.e., January 2015 – 2018 concatenated data. After preprocessing the data, I used the test train split method and applied 3 machine learning algorithms - Logistic Regression, GaussianNB and Random Forest. Along with their accuracies, I also plotted their precision, recall and f1 score.
Approach 2 –
In my second approach, I decided to take the training and testing data separately, i.e., training on January 2015 - 2018 data and testing on January 2019 data and applied 4 machine learning algorithms - Logistic Regression, GaussianNB, Random Forest and Decision Tree.
Approach 3 –
Now that I know which model is best trained on my dataset, in my third approach, I focused the test data on top 3 airlines and applied the best model i.e. Logistic Regression to predict which airline according to the top 3 has the most delays.
Approach 4 –
In my fourth approach, I applied Logistic Regression for predicting the delay on one airline i.e. DL (Delta Airline), for a particular origin i.e. ATL (Atlanta International Airport) and a particular destination i.e. DFW (Dallas/Fort Worth International Airport).
Image Credit: Techgrabyte [2]
Results
Exploratory Data Analysis Results (January 2015 – 2018)
Maximum number of flights were in 2018
WN, DL and OO, were the top 3 airlines with the highest flight count each year
WN, DL and OO, had the maximum delay % as well
There is a strong correlation between distance and airtime, as well as late aircraft delay and departure/arrival delay
ATL, ORD and DFW, were the busiest airports in this dataset
Approach 1 Result
My models are giving the following accuracy in predicting flight delays on the same testing and training data. We can see that the models are overfitting when testing and training on the same dataset.
Logistic Regression - 99.007
GaussianNB -
accuracy - 0.97
precision - 0.85
recall - 0.81
f1 - 0.83
Random Forest -
accuracy - 1.0
precision - 0.85
recall - 0.81
f1 - 0.83
Approach 2 Result
My models are giving the following accuracy in predicting flight delays on different testing and training data.
Logistic Regression - 0.83
GaussianNB - 0.83
Random Forest - 0.75
Decision Tree - 0.78
Approach 3 Result
Since we can see above that Logistic is trained well on my data, in approach 3 I used Logistic Regression for predicting top 3 airline delay accuracies on different testing and training data.
Southwest Airline - 80.80
Delta Airline - 87.73
SkyWest Airline - 72.20
Approach 4 Result
I tried Logistic Regression on one airline and a particular origin and particular destination –
DL – ATL to DFW - 82.27
Neural Networks
I applied fully connected Neural Network using 3 dense and then 5 dense layers. The results of the Deep Neural Network have high accuracies in flight delay prediction. Thus, the model is overfitting. Furthermore, application of this method drops memory space and time during the training.
Conclusion
Predicting flight delays was a challenging but interesting capstone research topic. My research focused on develop, grow and comparing the models in order to increase the precision and accuracy of predicting flight delays. Since the issue of flights being on-time is very important, flight delay prediction models must have high precision and accuracy [1].
This project is done in three parts – project information, exploratory data analysis and machine learning. In phase 3 i.e., machine learning – I experimented using 4 approaches by using machine learning algorithms like Logistic Regression, Decision Tree, Random Forest and GaussianNB.
Comparing the four models through various approaches, I conclude that Logistic Regression predicts the best accuracy for flights delays on my chosen dataset. I was best able to train logistic regression model to give 83% accuracy on flight delays. After further exploration and narrowing down the data, the logistic regression model predicts 87% delays in predicting Delta Airline delays. And lastly, the same model gives 82.27% accuracy when focused on one airline and a particular origin and particular destination i.e., via DL airline – from ATL to DFW.
Challenges and Later Work
Challenges
In my first approach I tried testing and training on the same dataset and received high accuracies, which probably means the models are overfitting. This can be because the models have too many terms for the number of observations.
Even the Deep Neural Network results had high accuracies. Thus, the model is overfitting. Additionally, application of this method drops memory space and time during the training. I also wasn’t able apply 1D CNN, as the kernel failed due to input file dimension error and when I tried to correct it, it failed due to memory error every time.
Later Work
The model works well with simple classification, but it starts over fitting with complex classification. Thus, for later works I will work on keeping the model simple but focus more on data cleaning to reduce variance by taking into account fewer variables and parameters, thereby removing some of the noise in the training data. Use regularization techniques that penalize certain model parameters if they're likely to cause overfitting. Hence, work on a better model that can prevent overfitting with this dataset.
I will further explore and fix the error I faced while applying 1D CNN. I also want to research more on how artificial neural networks are being used and can be used to help make informed decisions for the airlines and the passengers. There are many researches out there to explore and learn from.
Presentation - Phase 3
References
Yazdi, M. F., Kamel, S. R., Chabok, S. J., & Kheirabadi , M. (2020, November 26). Flight delay prediction based on deep learning and Levenberg-Marquart algorithm. Retrieved from https://link.springer.com/article/10.1186/s40537-020-00380-z
Techgrabyte: https://www.vectorstock.com/royalty-free-vector/data-processing-concept-in-circle-vector-22616168
Herbas, J. (2020, October 17). Using Machine Learning to Predict Flight Delays. Retrieved from Analytics Vidhya: https://medium.com/analytics-vidhya/using-machine-learning-to-predict-flight-delays-e8a50b0bb64c#:~:text=By%20using%20Machine%20Learning%20(ML,be%20delayed%20in%20many%20ways.&text=Just%20to%20show%20you%20my,an%20average%20increase%20of%2015%25.