This dataset contains the number of car accidents in the US from Feb 2016 to 2021. This is a countrywide car accident dataset, which covers 49 states of the USA. The accident data are collected from February 2016 to Dec 2021, using multiple APIs that provide streaming traffic incident (or event) data. These APIs broadcast traffic data captured by various entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road networks. Currently, this dataset has about 2.8 million accident records—link to Data Source.
The first step when doing an analysis is the preprocessing stage which will enhance the performance of the study.
First is data sampling; as mentioned previously, the dataset is significant; therefore, working with sample size is more convenient. A 20% is selected to work with for the analysis process.
Second is data cleaning; all datasets include missing values; therefore, dealing with them is necessary to enhance the performance of the analysis. Only three attributes: number, precipitation, and wind chill, have high NAs; therefore, they were removed.
Third is data selection; dropping ineffective attributes will ensure focusing on specific variables and extracting the maximum results; the ID variable represents the sequence, so it is not an important variable to keep. Longitude and latitude represent the geographic location. However, we have the city and state; therefore, they are not essential; zip code does not indicate anything relative or country. All accidents occurred in the United States, so it is also ineffectual to include airport codes. As does not mean anything close, and the weather timestamp is a duplicate of the start time of the accident; therefore, all those variables were omitted.
Fourth is data transformation; all the nominal attributes severity, city, state, weather conditions, sunrise-sunset, side, and street were classified as character form, thus transferring them into factor form is necessary to have a deeper understanding of the insights for each attribute as well as the starting time of the accident was transformed to date format.
Fifth, feature engineering is one of the most important parts because several new attributes are created from existing ones, providing more insights about the data and enabling the idea of deriving 19 more decisive conclusions. The first column is Road type with an outcome of City or Highway, extracted from the street column. Moreover, Year, Month, day, and hour were extracted from the start time column.