The historical hurricane data base includes data collected worldwide. The data is available for download in subsets filtered by “basin”. The available basins are: North Atlantic, Eastern North Pacific, Western North Pacific, North Indian, South Indian, Southern Pacific, and South Atlantic. With the focus of research on Florida, the North Atlantic basin data set was selected as the preliminary data source. This resulted in the data set having a non-trivial number of columns that were completely empty because they are for measurement reports from weather organizations that monitor other basins. These columns were reviewed to ensure they were not necessary and subsequently removed from the data set. There are also several columns that have redundant values. These include position and wind data. Some of the fields did have empty values and those were prioritized for removal following verification that the complete data existed in other retained fields.
Head of hurricane data pre-cleaning.
During preliminary data analysis, it was identified that in the field "basin," the North Atlantic was abbreviated as "NA''. This abbreviation was subsequently interpreted as "NaN''. In order to avoid treatment as a null value and other interpretive confusion, the field was transformed to use the abbreviation "NATL'' in lieu of “NA” for North Atlantic. Additionally, as imported, the first row in the data set was a list of parameter units for applicable fields. That row was extracted to its own data frame and then removed from the primary data frame.
Head of hurricane data post-cleaning.
Many of the fields representing numerical fields were not imported in numerical formats. Those fields were identified by evaluating data types and subsequently batch converted to the appropriate numerical format.
As this data set goes back to the year 1851, there are some portions of the data set that are not fully populated due to fewer measurements and less documentation available for older seasons. In general, the data set for recent years appears to be largely complete, so this is not a significant concern. The gas price data set begins in 2022, so the hurricane weather data was further filtered to match the years for which gas data is available.
Handling Issues / Noise in Data
When acquired, the Florida AAA gas price data had some gaps in dates for each metro; however, each row included a label with either "Today's Avg.", "Yesterday's Avg.", "Week Ago Avg.", "Month Ago Avg.", or "Year Ago Avg.". This information was leveraged to fill in missing dates. The team took "Yesterday's Avg." and "Week Ago Avg." and created new rows with their correct dates. This created some duplicate rows that were deleted.
The data gas price dataset still had some gaps in dates. Since hurricanes are typically within two weeks, the team wanted the dataset to include an estimate for every day to get the most detailed data for model training. The date gaps were at most seven days, so the team felt it appropriate to use linear interpolation to fill the rest of the dataset with approximations of gas prices.
Duplicates were evaluated and removed. There were no outliers.
Understanding Data
Data types (post processed)
date -> date object
metro -> string
regular, mid, premium, diesel, and reg_delt -> numeric
lat, long -> numeric
Summary Statistics for the gas price data
Heat map showing correlations of different types of gas and their price.
This QQ plot shows that the distribution of the percent change in regular gas prices is not normally distributed. This deviation from a straight line indicates that there is skew as the upper quantiles curve sharply. This aligns with what would be expected as gas and fuel shocks are often asymmetric due to downward price ridigity. Shocks to supply or demand for essential goods like gas more often result in an increase in price than a decrease.
An excerpt of the data scrape. The image on the left is a screenshot of the Wayback Machine's snapshot of the AAA gas price website for January 28th, 2022 (arbitrarily chosen). The image above shows the .csv rows corresponding to the information taken using the scraper. More specifically, the .csv above shows how the raw information is organized in the .csv.
The post processed data illustrates the added attributes and data rearrangement that was done to prepare the data for robust analysis.
Blue: The lat and lng columns were combined from the usmetros data frame (discussed in the data overview).
Orange: The reg_delt column displays the percent change in the price of regular gas from the previous day
Green: The date column serves as evidence for the rearrangement and interpolation of data for days not captured by the wayback machine