The notebook included 4 datasets representing flight data of each year from 2019 to 2023
There is a dataset for 2018, but it is not included in this project due to significant missing data
~29 million flights reported total (that's a lot of data!)
A subset of specific columns of the 4 datasets were concatenated into one dataframe
FlightDate: The date of the flight
Airline: The name of the airline operating the flight
Origin: The code or name of the origin airport (by AITA airport 3-letter code, ex. LAX)
Dest: The code or name of the destination airport
DepDelayMinutes: The delay in departure time in minutes
DepDel15: Whether the departure was delayed by 15+ minutes or more (1 for yes, 0 for no)
DayOfWeek: The day of the week of the flight (1-7 corresponds to Monday-Sunday, 9 for unknown)
Cancelled: Whether the flight was canceled (True for yes, False for no)
CarrierDelay: Carrier Delay, in Minutes
WeatherDelay: Weather Delay, in Minutes
NASDelay: National Air System Delay, in Minutes
SecurityDelay: Security Delay, in Minutes
LateAircraftDelay: Late Aircraft Delay, in Minutes
TextDayOfWeek: Day of week of the flight (strings)
DepAirportGeoID: GeoID for Departing Airport (for geospatial visualization)
ArrAirportGeoID: GeoID for Arriving Airport
This data was pulled from the United States Bureau of Transportation Statistics, and reporting carriers (airlines) are either required to or can voluntarily report data. It is unclear which airlines are subject to required reporting.
Of all scheduled flights were cancelled.
Of all scheduled flights were delayed by 15 minutes or more.
Attention passengers of flight DSCI304, this is your captain speaking...
Welcome aboard my flight data exploration project! Whether traveling for business or pleasure, finding out your flight has been delayed is at best annoying and at worst destructive to future plans or travel arrangements. Airlines, airports, and passengers alike would benefit from a delay-free travel experience, and understanding the data could help achieve that goal. From comparing causes of delays to visualizing delay-prone airport locations, join me on this data-driven journey!
Cape Cod Gateway Airport in Massachusetts has the most delayed flights per total departures (defined by a departure delay of 15 minutes or more), while Lewiston–Nez Perce County Airport in Idaho has the fewest.
Time Analysis: Delays vary by the day of the week. While these differences are influenced by the total number of departures in the dataset per day, the normalized graph shows a similar pattern.
Friday has the highest cumulative number of delays over the period of 2019-2022, so it seems to be the worst day for delays. On the other hand, Tuesday appears to be the best day for on-time flights!
This data is normalized to number of delays per airline, so small and large airlines can be compared fairly.
Weather Delay
Carrier Delay
NAS Delay
Late Aircraft Delay
Late arrival of the aircraft was the leading cause of most delays, with National Airspace System delays (such as heavy traffic volume), and Carrier Delays (such as aircraft cleaning, fueling, or cargo loading) following closely.
A significant dip can be seen in early 2020, corresponding with the beginning of the pandemic. As of 2022, total flights per day had not returned to pre-COVID-19 levels (as defined by average flights per day in 2019)
Now that we have a better understanding of delays-- causes, problem airports and airlines, and busy times for flying-- I would like to see if these factors can predict if a flight will be delayed. However, from the cause of delay exploration, NAS and Carrier Delays, which make up 70% of all delays, have little to do with the information contained in the dataset I used.
Overall, I think that this information could be useful to a customer when booking a flight, as they could choose an optimal day of the week, airline, and possibly airport in order to try to avoid delays.
Flight Status Prediction from Kaggle
US Shapefile from Maps package in R
Airport GeoIDs from Data.Humdata.org