After synthesizing and cleaning the data, we ended up with a dataset containing 33 predictors predictors and 191255 observations. A brief summary of the performance of our baseline model and our improved models is below.
First, we answer why we used multiple logistic regression as a baseline model. Since we are trying to predict the occurrences of types of crime, a quantitative description of whether a certain crime was in the Death category, for example, can be either of two values -- 0 or 1. However, a physically meaningful prediction of a type of crime must be a probability, which is between 0 and 1. The simplest model is a linear regression, but this generally gives unphysical results because lines with nonzero slope are unbounded objects. Instead, we use the well-known logistic regression, which uses a logistic transform to map the real line to the interval (0, 1). The logistic regression is therefore the simplest model that gives physically reasonable results for the probability of certain categories occurring.
Next, we answer why we thought a random forest model could improve the results from the multiple logistic regression. A random forest is a combination of lots of trees (in our case, 100), with the predictors randomly selected for each tree. We used a lot of predictors, not all of which are important; the random forest accounts for this by splitting first on the predictors which are most important. In other words, tree models are easy to interpret, since the predictors which are split over first tend to be more important. Another reason we thought a tree-based model makes sense is that it is, in some way, how real people (who might commit crimes) think. For example, a criminal might first see if the time of day is right to commit a crime, then check the weather, then check whether they are close to a streetlight, etc. We chose a random forest model rather than a simple tree model because averaging over many trees is more rigorous and fine-grained than using a single tree.
Now, we answer why we thought a neural network could also improve the results from the multiple logistic regression. Since our data lives in a high-dimensional space, we expect the fit to be truly nonlinear, unlike the logistic regression model, which is essentially a linear fit to a logit-transformed response. Neural networks are manifestly nonlinear; the basic logistic function and relu functions are already nonlinear, so a complex combination of them is highly nonlinear. We fit several different network topologies to the data, using the Keras package of TensorFlow, and found that adding more layers did not significantly increase the accuracy of the fit. Thus, we settled on a neural network with just two hidden layers.
Finally, we answer why we chose to optimize the random forest model in the hyperspace of random forest parameters. In our original random forest model, we specified the tree parameters by hand. For example, we specified a maximum tree depth of 18, but who knows if this is the best depth? A depth which is too small would not give us all the information; a depth which is too large would overfit. Therefore, we wrote code for iterations in hyperspace over the following random forest specifications: depth of trees, number of trees in the forest, number of predictors to split over in each node. We used cross-validation within the training set to prevent overfitting. Unsurprisingly, the calculation took over an hour. The performance of this optimized random forest on the test data was, indeed, somewhat higher than the performance of the user-defined random forest.
Interestingly, fitting the same model with all the additional predictors removed (only including the time data, month, day of the week etc) led to a predictive accuracy of 50%.
This observation was consistent with the finding in Rumi’s paper (“Crime event prediction with dynamic features,” EPJ Data Sci. 7, 43) that including additional predictors beyond the basic time and location data led to small yet significant increases in accuracy.
Intuitively, we expect crimes to be committed away from streetlights. Is this really the case?
To make our preliminary analyses easier to interpret, we used stripped-down models with relatively few predictors to discern the effect of streetlights, if any. We expect the presence of streetlights to be important only at night, and further, most important during the hours in which most people are asleep. Therefore, we will investigate the effect of streetlights by themselves on the different types of crime, and the interaction between streetlights and time-related predictors such as HOUR in which the crime was committed.
Results in a nutshell: Property and Public crimes tend to occur away from streetlights, while Force crimes tend to occur close to streetlights. The other types of crimes did not show significant deviations with respect to their distance from streetlights.
Here are the models we created, along with a summary of the results:
We now summarize the results of our findings. Distance to the nearest streetlight seems to be a significant predictor (different for day and night time) for Force and Property crimes, and possibly for Public crimes. The sign of our t-statistic suggests that Force crimes tend to occur nearer streetlights, while Property crimes tend to occur away from streetlights. Public crimes also tend to occur away from streetlights.
Although we didn’t include other (possibly confounding) data, such as distance to the nearest school, in this elementary analysis, we do not expect any of the other variables to correlate significantly with distance to nearest streetlight or with HOUR of day the crime was committed. Therefore, we are reasonably confident that our simple models give us valuable information.
Inequality is a famously difficult idea to measure numerically. As proxy measures of inequality, we used the following predictors: Gini coefficient for the census tract in which the crime was committed, median income in the census tract in which the crime was committed, and total value of property within a 200-meter radius of the crime. We also used percentage of people in high-income housing, percentage of people with low education, percentage of people with high education, percentage of people in new housing, percentage of people in old housing, and percentage of people in poverty.
We make some implicit assumptions here. For example, it’s not immediately clear that income is a good measure of inequality -- what if everyone in the area has exactly the same income, so there is perfect equality?
We think it’s not a bad assumption to assume that there are always people on the lower income end of the spectrum, regardless of whether there are also rich people. Therefore, the richer the area, the more unequal it tends to be. This is why we use the median income and total value of property within a 200-meter radius as predictors.
Results in a nutshell: Of the seven predictors we began with, median income and total value of property within a 200-meter radius of the crime were not very important; all other predictors were important. The occurrences of all types of crimes tended to increase with every measure of inequality, but we did not find anything to suggest that any type of crime increased with inequality more severely than the others.
Here are the models we created, along with a summary of the results:
We can summarize our results: According to the penalized logistic regression (lasso), it seems that the least effective measures of inequality were income and log-transformed property value, since they tended to be regularized away by the LASSO norm.
Now that we’ve investigated each measure of inequality and eliminated the ones which did not give much information, we would now like to investigate which crimes they are correlated with. To do so, we will flip the predictor-response paradigm of inequality measures predicting crimes, and instead use the occurrences of crimes to predict inequality measures! We do this for two reasons. First, the dummy variables for the occurrences of crimes can only be 0 or 1, so we can comfortably compare the regression coefficients on different types of crimes (the scales of Gini coefficient and race data aren’t the same, and even if we normalize different predictors, it still isn’t clear that we can compare their numerical values, and the numerical values of their associated slopes.) A higher slope means a stronger correlation between a type of crime and a certain measure of inequality.
We found that the slopes of the crime-type predictors were always positive, which means that increasing rates of any type of crime is associated with an increase in every measure of inequality, which makes sense. Below, we list the relative strengths of the correlations between crime types and each measure of inequality:
There seemed to be no apparent trend between which category of crime is associated with all or most measures of inequality. We conclude that the occurrences of all types of crimes tend to increase with any measure of inequality, but that no type of crime tends to increase more than the others.
In this section, we outline some of the modeling decisions we made, both when answering the lower-level questions and when working on the large model for predicting types of crime.
In the lower-level question about streetlights, we realized that the geographic distribution of anything relative to a fairly uniformly-spaced grid (such as that of streetlights) is uniform in area but not in distance. That’s because the derivative of the area of a circle with respect to its radius increases linearly with the radius. However, lengths are easy to interpret; we decided that since streetlights shouldn’t matter for crimes committed during the day, we could use daytime crimes as a “null” or “placebo” test group on which to compare crimes committed during the night, for which distance to the nearest streetlight might in fact matter.
In the lower-level question about inequality, we were having difficulties distinguishing how important each predictor was to the different types of crime. This is because the different predictors of inequality (for example, percentage of people in high income housing and property values) are not directly comparable, even if we normalize them. For example, it’s very reasonable to have a percentage of people in high income housing approach zero for very poor neighborhoods; however, it’s not reasonable for the low-education residents approach zero in any neighborhood. We came up with a creative way to resolve this problem by flipping the predictors with the responses and using the occurrences of different types of crimes to predict the measures of inequality. The probability that a certain crime occurs is directly comparable with the probability another crime occurs, and flipping responses and predictors doesn’t change whether they are correlated.
The trajectory of the project changed slightly when we decided to change the categories of crimes that we were predicting; we eliminated No_offense and Other as crime categories. This allowed to to train better and more interesting models, since crimes (or, not even crimes) which go in the No_offense and Other categories weren’t very related even just by the types of crimes we decided to put in these categories. In contrast, the kinds of crimes we included in Death, for example, were superficially similar to each other. Indeed, we got higher accuracy after we eliminated the two “umbrella” categories.
In addition, in an attempt to improve upon the perceived poor performance of our baseline model we decided to include many more predictors than originally intended. We added data on racial makeup of the census tracts as well as more predictors of inequality, such as percentage in high-income housing. We therefore had to incorporate another section to our answer to the to the lower-level question about inequality: which measures of inequality matter?
Our best model was an optimized random forest model which had an accuracy on the test set of 56.6%. We used a random grid search with cross-validation to find the best hyperparameters. The best parameters found only led to a very marginal improvement of the random forest model form 56.3% to 56.6%.
We can Evaluate our model by looking at a confusion matrix and ROC scores
We plotted the AUC curve for each class of crime as well as the overall average. Here the auc is calculated for detecting each class relative to each of the other classes combined. Class 0 is the property class which is the largest class by far. Our model had the best success correctly classifying property crimes. Class 5 which hugs the lower corner was the class for crimes involving death. This was by far the smallest class with only 53 occurences in the test set. Our model struggled to predict this class given its rare occurrence. However, it was sophisticated enough to detect some true positives. Overall on average the model fell below the accuracy given a naive classifier. However, this does not take into account the fact that we are trying to predict between seven different classes and not just two.
From the random forest model we can also get a ranking of feature importance. We combined the important scores for all of the dummy encoded day of the week variables into a single score. Ultimately this combined predictor was ranked second in importance. The top predictors were all temporal predictors that were derived from the original dataset. This makes sense given the importance of temporal predictors seen in our lit review. The next most important predictors are spatial predictors that we derived based on the precise location of the crime. The AV_total and mean prop value predictors were derived from the conditions within a certain radius of the crime. The distance metrics also seem to be important. None of the demographic predictors from the census data were particularly important. This may be because the census tracts are not granular enough or because these are very general features that do not affect that types of crimes. We expect that these predictors could be important for predicting overall crime rates within a certain census tract. The extremely low value for population can be explained by the fact that census tracts are deliberately chosen to have approximately the same population.
In general, it is important to take systematic approaches to policing and using data can allow law enforcement agencies to better serve their communities. The problem posed in this project is probably not the one of greatest interest to law enforcement because it predicts types of crime based on police reports where they already know the type of crime. However it could prove to be useful to supplement officer’s intuition about the situation when responding to a call at a certain location at a certain time. Given that officers are typically experienced and well trained are are likely to have domain specific knowledge about recent trends in crime, gang hotspots etc, we believe the additional value our model would add is low. Another way our model could be used is for forecasting trends into the future. It could be used to evaluate how the distribution of crimes will change as the city goes through changes in terms of demographic breakdown and geographic features. For example, one could forecast how gentrification could affect the policing needs of a neighborhood.