While analyzing the dataset, performing box plot for finding the outliers is one of the necessity step. When analyzing the temperatures, it can be seen some temperatures are outliers. But can they be removed? No, at least not in this case. This is because these are rare occurrences and do exist. This box plot gives the jist of how temperatures are distributed and 5 number summary.
After that analysis of delays is very important to check for outliers. In this case weather_delay, arrival_delay, departure_delay are plotted using box plot. Many outliers can be observed, but on further analysis there are occurrences in which flights were delayed up to 40 hrs due to harsh weathers. But these are rare occurrences, hence the outliers cant be removed.
Histogram is employed to assess the distribution of counts within various temperature categories. The histogram reveals an uneven distribution across temperature categories. The graph indicates that temperatures below 0°F are infrequent, possibly due to their rarity in the United States. On the other hand, the categories spanning 50-85°F are more prevalent, reflecting the average temperatures in the majority of U.S. states falling within these ranges.
Violin plots, in contrast to box plots, provide a visual representation of data distribution. In this context, the comparison between temperature and weather delays is presented to gain insights into how weather delays are distributed across different temperature ranges.
Line plots offer a straightforward way to observe trends over time, making them a suitable initial approach for analyzing average departure and arrival delays across months. Upon examination of the plot, it becomes clear that arrival delays consistently surpass departure delays, and no discernible pattern is evident. The only noteworthy pattern, or more precisely, a pattern-like observation, is the increasing delays towards the tail end. However, apart from this, there are no significant patterns or trends in delays across the months.
Donut charts are employed to calculate and analyze the proportions of different delays, helping identify the most prevalent ones on a larger scale. The graph reveals that weather delays are the most frequent, while security delays are the least common, leading to the conclusion that weather delays occur most frequently and security delays are the least prevalent.
A heatmap is effective for examining the correlation among various numerical variables. In this case, the correlation matrix for different delays is computed and scrutinized to identify any potential correlations. However, the outcomes suggest that there is no discernible correlation among any of the delays.
The top 10 states were selected by filtering them according to their average weather delays, and the data was visualized through a horizontal bar graph. The graph clearly indicates that North Dakota experiences the greatest average weather delay, closely followed by Idaho. Upon closer examination, it becomes apparent that the top four states out of the top 10 exhibit nearly identical average delays.
Stacked bar chart was created to illustrate the proportion of different delays across various temperature ranges. The graph clearly indicates that weather delays dominate across all temperature ranges. Moreover, delays are notably higher in temperatures below 0°F and suprisingly NAS_delay is higher in 0°F - 32°F compared to below 0°F which would be a good topic to investigate.
A grouped bar chart was used to illustrate the impact of delays on different airlines. Delta Airlines and SkyWest Airlines emerged as the most affected by weather delays, while Hawaiian Airlines experienced the least impact. Various factors, such as coverage and operational considerations, may contribute to these differences among airlines.
A bubble plot, a chart capable of comparing three parameters simultaneously, was utilized. Latitude and longitude data were plotted against weather delays, represented by bubble size, while the color denoted different states across the U.S. Upon examination, no discernible pattern emerged that could indicate a significant relationship among these three parameters.