Every year car accidents cause thousands of deaths, millions of injuries, and billions of dollars of expense in this country and around the world. America’s roadways are a huge source of human and financial loss, one that we have become unfortunately accustomed to and regard as an externality of our lifestyles and the current transportation methods we undertake. The fact is, car accidents cause far too much damage to ignore, and efforts should and are being made to engineer our cars and roads to be safer. For my part, I’m seeking to work with an extensive collection of accident data to be able to predict accident severity based on available information and uncover the biggest risk factors that lead to accidents.
The specific question I am asking has two parts. First, I want to predict, based on geospatial, time, and weather data, the severity of an accident as soon as it is reported. The data I am using rates severity on a 1-4 scale, determined by how much of a delay the accident causes. It is apparent shortly after an accident takes place how severe it is based on the ensuing backup, but it is not often immediately apparent. I am seeking to be able to predict, based on information that would be known from the very first accident report, how severe an accident is so that an accurate estimation of delays can be immediately reported to drivers on the road. There will be an emphasis on being able to accurately predict the most severe accidents, as these cause the most severe delays. In application, this model could be possibly embedded within mapping software. Secondarily, I want to determine which factors are the biggest determinants of these severe accidents. The importance of this is self-evident. Cars could be engineered more safely, road conditions could be improved, more streetlights could be built, etc. There exist many potential possibilities to increase safety based on whatever is discovered.
For my methodology, I want to first conduct a thorough exploratory data analysis after the data is cleaned and ready for use. There is a great deal of very informative visualization potential given the geographic nature of this data, as well as the potential to conduct some spatial statistical tests on certain locales of interest. I intend to use factor analysis methods to uncover the biggest determinants of severe crashes and fit the best multiclassification model I can use to predict severity.
References
•Fivethiryeight.com
•Kaggle.com
•Asirt.org
With advances in technology that have led to modern assisted and autonomous driving technologies, there has been a great deal of research interest in the reduction of traffic accidents. The “Race to Zero”, as in zero accidents, zero emissions, and zero congestion was a previously unthinkable goal that is now the benchmark that every player in the mobility business is striving for. Reducing the number of severe collisions is a huge component of this, and that is the topic of the relevant literature that I have reviewed. Both Jianfeng, et al and Garrido, et al carried out studies attempting to predict accident severity based on a conglomeration of vehicle, driver, environment, and accident information. My goal fundamentally differs from theirs in that I am seeking to predict accident severity based solely on information available immediately upon accident reporting in concordance with the use case of a GPS program being able to immediately notify motorists. Also, severity in my data is defined as the delay an accident causes, rather than vehicle damage or injury sustained. Nevertheless, previous research proved very insightful towards my goals.
Jianfeng, et al and Garrido, et al both wrote very favorably about the effectiveness of ordinal logistic regression and tree models on this data, which is the path I will pursue given that the features I plan to use are very similar to theirs. Also, Jianfeng, et al’s methodology of factor analysis was very insightful, and it was reflected in a positive increase in model performance. In addition, even though my research does not focus on this, given the predicted emergence of autonomous cars, it was prudent to find what research existed on AV collisions. While data here is extremely limited, there was a study performed in 2019 on 114 AV accident records from California that can serve as a starting point for research into the topic. In general, it found that the majority of severe accidents occurred during autonomous driving mode and were the fault of the AV system, while nearly every minor accident was because of human error.
In my exploratory data analysis, I found accident severity to be highly bimodal, with most observations classified as either a 2 or 3 out of 4. The roadway type appears to have the strongest influence on this, and I may have to do more creative feature engineering to extract more location information given that Garrido, et al found more severe accidents occurred in rural areas. 2 interesting outliers were also found, West Virginia and Arkansas both had statistically significantly higher percentages of severe accidents, with a satisfactorily large sample size.
Since the EDA/ Data exploration stages of the project, quite a lot has been added. All major trends have been identified in the data, and major steps have been made towards the main goal of predictive analysis. However, there were some interesting finds from data exploration that also merited further analysis. My intent was to further explore geographic anomalies if any were found, and West Virginia and Arkansas were found to have a highly disproportionate rate of very severe accidents. While data was available on traffic volumes in these states, it was very difficult to find any hard data specifically related to the roadways within these states that were prone to these accidents. Given this, any conclusions are difficult to firmly assert given the lack of quantitative evidence available. Strangely as well was West Virginia, which contained none of the possible explanatory factors Arkansas had, and therefore its severe accident rate is still unexplained.
With what information was available, I dug into Arkansas. Almost all of the level 4 accidents were found to have occurred on the two major interstate highways that run through the state, I-40 and I-30. These highways have a very high volume of traffic, with I-40 receiving a daily average of 3,698,290 cars, and I-30 receiving 3,859,140 cars daily on average (2017). Note that these figures are from 2017, the most recent year data was available. Following these are I-49 with 3,011,310 daily cars and Route 71 with 2,602,570 cars. I-40 and I-30 are also among the busiest freight routes in the U.S., with rough estimates of at least 50,000 trucks on these routes per day (2017). However, traffic volume alone is not always correlated with more severe accidents, as many states and localities with notoriously high volumes of traffic did not register abnormally high amounts of severe accidents.
After this is where I ran into a lack of quantitative information and had to turn to qualitative sources. Numerous news reports were easy to find regarding road widening projects for different stretches of I-40 and I-30, with a general opinion that these efforts were much-needed and likely overdue. Also, the presence of large amounts of national forest in the state appeared to be a contributor towards more narrow highways and high traffic volumes. While it is impossible to make any firm conclusions given the lack of quantitative backing, it can be hypothesized that the amount of severe accidents in AR can be attributed to a combination of high traffic and freight volumes with roadways poorly equipped to handle these.
After exploration of these outliers, modeling could begin to predict severity. First, features with very little explanatory power were removed. Numerous features were then added based on information available from the accident descriptions that would be appropriate to use in a scenario when an accident is immediately reported. Such features included whether it was a multi-vehicle crash, whether a road/ramp was immediately blocked, whether it occurred in a construction zone, and several more. While more useful features were added, the data itself presented a challenge in model construction. The distribution by severity is extremely imbalanced, with 99.6% of accidents being classified as either a 2 or a 3. 1 and 4 combined for only .4% of all accidents, which begs the question of how exactly severity is defined given that so few fit the criteria for 1 or 4. In fact, there is no exact criteria for severity, just a general statement that it represents delay caused by the accident. With these factors, I considered lumping in 1s with 2s and 4s with 3s and turning this into a binary classification problem given the extreme imbalance and lack of clarity on the differentiation between classes. Ultimately, I decided to go forward with this as a multi-classification problem.
Initially, I ran an untuned random forest model with roughly balanced classes via resampling minority instances. Results were accuracies of over 92% for levels 2 and 3 accidents, but 0% for 1 and 4 accidents. The model was completely ignoring minority instances. At a crossroads here, fine-tuning my model would not fundamentally fix this issue. With many possible options, I chose to try synthetically generating minority instances rather than resampling them. This made a large difference in my model, as training on this data improved accuracy of level 4 accidents to 53.3% rather than 0%. This came at the cost of increased variance, as level 3 accuracy dropped to 73% from 92%, and level 1 accidents remained ignored. While this method showed promise, more work was necessary. Level 4 accidents need to be differentiated more from level 3 via their attributes, and I performed text analysis of the descriptions to this end. The results found were interesting and will be highly useful, as there are numerous words that appear at far different frequencies in level 4 descriptions vs level 3, and these can be engineered into features. In the final phase, my intent is to finalize a robust dataset with very precisely chosen features, and fine tune the synthetic data generation and random forest algorithms.
While results were significantly improved, there was still much work to be done. More features were necessary to differentiate level 4 accidents, so I conducted some text analysis of the descriptions of level 3 and 4 accidents to see if any more information could be extracted. Several words were found to be disproportionally more or less present in level 4 descriptions, which were engineered into new features. “Near”, “ramp”, “slow” and “center” appeared significantly less in level 4 descriptions, while “lanes”, “closed”, “emergency” and “detour” among others appeared significantly more often. Ultimately, I chose to include features indicative of an earlier accident, whether or not the roads had been reopened, whether there was an immediate detour, emergency vehicles immediately on the scene, whether the accident was near another event, and a flag for weekdays/weekend. Also, I added a measure for population density in the area of the accident. This was not done programmatically, rather I put the data into a GIS software, added in population density based on the county the accident was located in, and then exported the data back out. Several extremely spare features were also removed.
With a greater array of features now included in the data, it needed to be determined conclusively which ones were relevant. A chi-squared test was done to determine the statistical significance of categorical features per their distributions across severity levels, and all were shown to be significant. With all prudent preprocessing and feature engineering steps completed, a good model could be now fleshed out.
Along with the challenges of data imbalance, the sheer size of the dataset was a challenge in it of itself. For my modeling I was working on a sample of approximately 500,000 data points, and even this reduced amount required some extra steps to be taken for code to run relatively quickly. Sk-learn performed extremely slowly on the dataset of that size, so I turned to H2O, an ML library with built in parallel processing capabilities, which sped up modelling dramatically, especially useful for grid searches.
The best random forest model produced mixed yet decent results that held up on testing data. As before, the model is very robust at classifying level 2 accidents, but worse for level 3 and 4. Mean per-class-accuracy hovers around 77%, and overall is around 86%. A strong ROC_AUC score was buoyed by the robustness of the model predicting level 2, and level 3 to an extent. On the flip side, a weak F1 score was caused by the lack of precision in predicting level 4 accidents. A one vs. rest support vector machine approach was tried due to penalized SVM’s strength at detecting minority classes, but the results were not good.
Ultimately, there appears to be a ceiling on how effective a given model can be with this specific data and task, which is something I suspected coming in. Overrepresenting severe accidents increases the amount of these that are correctly classified, but with a very low precision, and at the cost of accuracy for level 3 predictions. However, the results are still reasonably good. Predicting such a small minority class is reminiscent of modeling fraudulent financial transactions, a process in which it can be prudent to trade off precision for recall. Traffic delays are slightly less important to alert to, and this once again raises the question as to the distribution of accident severities. In context, a model similar to mine would function as a “warning system”, indicating a probable moderate delay with a chance of being severe whenever a level 4 accident is predicted. However, the main sticking points with this type of work going forward is with the data and features. Many of the features were very sparse, and did not provide a very large amount of gini when used for splitting, as can be seen from the low optimal maximum depths and the overall model performance. Whether or not the accident occurred on an interstate highway was by far the most important feature, speaking again to the importance of location and road information. Vehicle and specific road data were difficult to find and could have made an impact on the analysis if it could have been. Ultimately, accurately predicting delays from immediately known variables is a very accomplishable task with the toolbox and modeling I have established. However, several important questions about the somewhat black box of data provided by the GPS services must first be clarified.