Conclusion

conclusion

Results

We briefly reiterate the results of our research.

  1. Distance to the nearest streetlight seems to be a significant predictor (different for day and night time) for Force and Property crimes, and possibly for Public crimes. The sign of our t-statistic suggests that Force crimes tend to occur nearer streetlights, while Property crimes tend to occur away from streetlights. Public crimes also tend to occur away from streetlights.
  2. Of the seven measures of inequality we began with, median income and total value of property within a 200-meter radius of the crime were not very important; all other predictors (Gini coefficient for the census tract in which the crime was committed, percentage of people in high-income housing, percentage of people with low education, percentage of people with high education, percentage of people in new housing, percentage of people in old housing, and percentage of people in poverty) were important. The occurrences of all types of crimes tended to increase with every measure of inequality, but we did not find anything to suggest that any type of crime increased with inequality more severely than the others.
  3. The importance ranking of the predictors from our random forest models suggested that the predictors associated with time -- such as HOUR, DAY OF WEEK, MONTH, and YEAR -- were most important in predicting which type of crime will occur.
  4. Our highest test accuracy was 56.6%. Since optimized random forest and neural network models are some of the most sophisticated and advanced models that exist, it seems that predicting the types of crimes is very challenging. There is a lot of unpredictable variation in the data since humans are themselves unpredictable.

Strengths

The classification accuracy that we achieved was relatively good given the challenge of distinguishing between 7 different classes. The accuracy was as high as that reported in the Almanie paper which also attempted to predict the quality of different types of crime.

Another strength of our research is the consistency of results among our various complex models -- neural network, random forest, and optimized random forest all had similar peak test accuracy. The presence of a “glass ceiling” or upper bound on our accuracy, regardless of model, suggests that for future approaches, improvement on our existing models will require not a more sophisticated model on the same data, but the introduction of completely new data (such as temporal data or time series predictions) and completely new methods.

The Random forest models that we used gave transparent and easy-to-interpret importance rankings that allowed us to evaluate which predictors were most important. We were able to replicate results from multiple papers showing the importance of temporal data in predicting crime. In addition some of the more ‘creative’ metrics that we used such as the minimum distance to certain landmarks turned out to be useful predictors.

Our model and further analysis led us to insightful answers to the lower level questions posed at the beginning of this project. Specifically we were able to determine that streetlights only have a significant effect on certain types of crimes and that income inequality affected crime rates but not the proportion of different crime types.

Challenges

One of the main challenges of our project was that the distribution of crime categories was quite unbalanced. For example, there was an almost vanishingly small proportion of deaths predicted -- so small that our neural network model did not predict any deaths on the testing data. We included deaths because we thought murders and manslaughters were an intrinsically serious (and therefore important) category, but we may have been able to improve this approach. We discuss possible improvements to the selection of crime categories, through the method of “clustering,” below.

Another challenge of our project was the complexity of using so many predictors. Using many predictors made our models slow to run (for example, our optimized random forest model took over an hour to compile) and using many predictors can sometimes obscure the effect of any one of them. We partially solved this problem with importance ranking of the predictors in the random forest models, but it’s difficult to construct an importance ranking for neural network models, for example.

Finally, we might have improved the power of our model if we had used dynamic data. For example, criminals might not commit crimes if it was too cold or snowing, or if certain politicians we elected. Or, perhaps the past occurrence of a certain kind of crime could be a significant predictor of future occurrences. We outline a future direction of work in this area, using time series, below.

Future Work

For a tool to be useful to law enforcement officials or city policy makers, there needs to be a certain level of confidence in the predictions. Therefore a potential future approach would be to reverse engineer a model that would have high predictive accuracy. We propose using unsupervised learning techniques to separate the crimes in to well-defined clusters based on their feature values. Then we can see if the crimes in each of the clusters fall within the same of similar categories. We can define a category for each cluster that can be predicted with high accuracy and hopefully would prove useful for officials.

Another future direction of work could be to include more sophisticated time-series analysis of the existing crime data. Perhaps the temporal occurrences of some crimes are not random; for example, maybe drug trading happens on a regular basis. We found that HOUR was our most important predictor; perhaps we can visually determine the trends with HOUR for each type of crime. The analysis of data which has periodic trends can be done with generalizations of the SARIMA framework; in this case, to spatial data. This seems to be not just a difficult problem for our data set but a difficult problem in all of statistics. However, if there are nontrivial periodicities, this kind of analysis could help police departments in tracking and predicting crimes, since they would only need to expend additional resources when they know certain crimes are more likely to occur.