Models and Results

Modeling approach

After synthesizing and cleaning the data, we ended up with a dataset containing 33 predictors predictors and 191255 observations. A brief summary of the performance of our baseline model and our improved models is below.

Baseline model (multiple logistic regression, with all predictors): Test accuracy of 33%
Improved model 1 (Random forest with 100 trees and maximum depth of 18): Test accuracy of 56.3%
Improved model 2 (Neural network with two hidden layers, each with 100 nodes): Test accuracy of 53%
Improved model 3 (Random forest optimized over number of trees, depths, and number of predictors at each node): Test accuracy of 56.6%

First, we answer why we used multiple logistic regression as a baseline model. Since we are trying to predict the occurrences of types of crime, a quantitative description of whether a certain crime was in the Death category, for example, can be either of two values -- 0 or 1. However, a physically meaningful prediction of a type of crime must be a probability, which is between 0 and 1. The simplest model is a linear regression, but this generally gives unphysical results because lines with nonzero slope are unbounded objects. Instead, we use the well-known logistic regression, which uses a logistic transform to map the real line to the interval (0, 1). The logistic regression is therefore the simplest model that gives physically reasonable results for the probability of certain categories occurring.

Next, we answer why we thought a random forest model could improve the results from the multiple logistic regression. A random forest is a combination of lots of trees (in our case, 100), with the predictors randomly selected for each tree. We used a lot of predictors, not all of which are important; the random forest accounts for this by splitting first on the predictors which are most important. In other words, tree models are easy to interpret, since the predictors which are split over first tend to be more important. Another reason we thought a tree-based model makes sense is that it is, in some way, how real people (who might commit crimes) think. For example, a criminal might first see if the time of day is right to commit a crime, then check the weather, then check whether they are close to a streetlight, etc. We chose a random forest model rather than a simple tree model because averaging over many trees is more rigorous and fine-grained than using a single tree.

Now, we answer why we thought a neural network could also improve the results from the multiple logistic regression. Since our data lives in a high-dimensional space, we expect the fit to be truly nonlinear, unlike the logistic regression model, which is essentially a linear fit to a logit-transformed response. Neural networks are manifestly nonlinear; the basic logistic function and relu functions are already nonlinear, so a complex combination of them is highly nonlinear. We fit several different network topologies to the data, using the Keras package of TensorFlow, and found that adding more layers did not significantly increase the accuracy of the fit. Thus, we settled on a neural network with just two hidden layers.

Finally, we answer why we chose to optimize the random forest model in the hyperspace of random forest parameters. In our original random forest model, we specified the tree parameters by hand. For example, we specified a maximum tree depth of 18, but who knows if this is the best depth? A depth which is too small would not give us all the information; a depth which is too large would overfit. Therefore, we wrote code for iterations in hyperspace over the following random forest specifications: depth of trees, number of trees in the forest, number of predictors to split over in each node. We used cross-validation within the training set to prevent overfitting. Unsurprisingly, the calculation took over an hour. The performance of this optimized random forest on the test data was, indeed, somewhat higher than the performance of the user-defined random forest.

Interestingly, fitting the same model with all the additional predictors removed (only including the time data, month, day of the week etc) led to a predictive accuracy of 50%.

This observation was consistent with the finding in Rumi’s paper (“Crime event prediction with dynamic features,” EPJ Data Sci. 7, 43) that including additional predictors beyond the basic time and location data led to small yet significant increases in accuracy.

Lower Level Questions

Lower level question 1: effect of streetlights

Intuitively, we expect crimes to be committed away from streetlights. Is this really the case?

To make our preliminary analyses easier to interpret, we used stripped-down models with relatively few predictors to discern the effect of streetlights, if any. We expect the presence of streetlights to be important only at night, and further, most important during the hours in which most people are asleep. Therefore, we will investigate the effect of streetlights by themselves on the different types of crime, and the interaction between streetlights and time-related predictors such as HOUR in which the crime was committed.

Results in a nutshell: Property and Public crimes tend to occur away from streetlights, while Force crimes tend to occur close to streetlights. The other types of crimes did not show significant deviations with respect to their distance from streetlights.

Here are the models we created, along with a summary of the results:

T-test (difference between means) on minimum distance to the nearest streetlight: day versus night.
- Mini-question: What types of crime do streetlights have an effect on? We assume that streetlights are important mostly in the night, so crimes committed during the day are akin to a “null” control group.
- In this model, we chose a type of crime. Then, we split the data into two groups -- crimes committed during the day versus those committed during the night, and performed a t-test for difference between means on the minimum log-transformed distances of those crimes to the nearest streetlight. The splitting of the data set was achieved by creating a binary variable is_night, from information about the HOUR at which the crime was committed.
- Results: The crime categories which showed a significant (p < 0.05) difference in distance to streetlight between day and night were: Public, Force, Property. We were surprised, for example, that the Death category did not shy away from streetlights during the night.
- Explanation: Perhaps crimes involving deaths are either committed out of rage or by accident, such that they do not have much dependence on distance to streetlights. Crimes involving Property involve stealing and so it makes sense that they tend to occur further from streetlights. Perhaps crimes involving Force, such as assault, occur away from streetlights for similar reasons.
Logistic regression on type of crime: predictors are log-transformed distance to the nearest streetlight, “deepness of night,” and the interaction between these two predictors.
- Our previous model, which relied on t-tests, was a little crude. Perhaps we would get more interesting results by using a more sophisticated regression model.
- We defined the new variable deep_night to be the absolute value of the difference between HOUR and 14 (i.e. 2 pm). We thought that if we turned HOUR into a new variable which increased as we got further into the night (and reaches a maximum at 2 am), that this new variable would have nontrivial interaction with the distance to the nearest streetlight.
- We ran a logistic regression on the indicator (dummy) variable for each type of crime, versus the above predictors.
- Since the probability P(r → r + dr) of being in a certain radius of a streetlight increases with radius r, we’re not really that interested in the slope of the probability versus distance to the nearest streetlight -- it will geometrically have a nontrivial relationship, by definition.
- Instead, we are interested in whether there was an interaction between the log-transformed distance to the nearest streetlight and the variable deep_night. This was observed for the following categories of crime: Force, Property.
- Explanation: As expected, there was a lot of overlap between the types of crimes which had significant effects between streetlights and HOUR -- the types of crimes are the same, except for Public. We argue that one reason for this is that public facilities close at late hours.

We now summarize the results of our findings. Distance to the nearest streetlight seems to be a significant predictor (different for day and night time) for Force and Property crimes, and possibly for Public crimes. The sign of our t-statistic suggests that Force crimes tend to occur nearer streetlights, while Property crimes tend to occur away from streetlights. Public crimes also tend to occur away from streetlights.

Although we didn’t include other (possibly confounding) data, such as distance to the nearest school, in this elementary analysis, we do not expect any of the other variables to correlate significantly with distance to nearest streetlight or with HOUR of day the crime was committed. Therefore, we are reasonably confident that our simple models give us valuable information.

Lower level question 2: effect of inequality

Inequality is a famously difficult idea to measure numerically. As proxy measures of inequality, we used the following predictors: Gini coefficient for the census tract in which the crime was committed, median income in the census tract in which the crime was committed, and total value of property within a 200-meter radius of the crime. We also used percentage of people in high-income housing, percentage of people with low education, percentage of people with high education, percentage of people in new housing, percentage of people in old housing, and percentage of people in poverty.

We make some implicit assumptions here. For example, it’s not immediately clear that income is a good measure of inequality -- what if everyone in the area has exactly the same income, so there is perfect equality?

We think it’s not a bad assumption to assume that there are always people on the lower income end of the spectrum, regardless of whether there are also rich people. Therefore, the richer the area, the more unequal it tends to be. This is why we use the median income and total value of property within a 200-meter radius as predictors.

Results in a nutshell: Of the seven predictors we began with, median income and total value of property within a 200-meter radius of the crime were not very important; all other predictors were important. The occurrences of all types of crimes tended to increase with every measure of inequality, but we did not find anything to suggest that any type of crime increased with inequality more severely than the others.

Here are the models we created, along with a summary of the results:

T-test for difference in Gini coefficient: A t-test for difference in means shows that the difference in average Gini coefficient for particular types of crimes nearly always turned out to be significant. Rather than list all pairs of crimes we tested, we instead show that there are noticeable differences from the boxplots:

Multivariable logistic regression: We would like to know which measures of inequality are correlated with certain kinds of crimes. To do so, we fitted a logistic regression on the occurrence of each type of crime, to the following predictors: Gini, income, log_mean_prop_value, percentage of people with high income, percentage of people with low education, percentage of people with high education, percentage of the population which is in poverty. Here are our observations for each type of crime:
- Death was very negatively correlated with the percentage of people with high education.
- Drugs were strongly correlated with all predictors used. However, in this model we might have seen some adverse effects of using too many correlated predictors at once. (For example, the output of the model suggests that drug use increases with the proportion of low-educated people and decreases with the proportion of people in poverty. These seem to be contradictory.)
- In the other models, we found similar behavior to the Drugs predictor. The presence of so many correlated predictors is confusing, so instead of trying to interpret these results, we turn to a more sophisticated model
Penalized logistic regression (lasso): We choose to regularize our multivariable logistic regression to learn which of our seven measures of inequality is the best predictor of crime type, for each crime type. We use lasso rather than ridge regression because lasso tends to set coefficients exactly to zero, due to the L1 norm. Unfortunately, lasso does not give p-values by design, so we do not report p-values or t-statistics here.
- Death: income, Percentage with high income, percentage with low education, and percentage with high education were unimportant.
- Drugs: all predictors were important (none were regularized away).
- Force: Percentage with low education was unimportant.
- Money: Income and log-transformed property value were unimportant.
- Private: No predictors were regularized away.
- Property: Income and percentage with high income were unimportant.
- Public: log-transformed property value was unimportant.

We can summarize our results: According to the penalized logistic regression (lasso), it seems that the least effective measures of inequality were income and log-transformed property value, since they tended to be regularized away by the LASSO norm.

Now that we’ve investigated each measure of inequality and eliminated the ones which did not give much information, we would now like to investigate which crimes they are correlated with. To do so, we will flip the predictor-response paradigm of inequality measures predicting crimes, and instead use the occurrences of crimes to predict inequality measures! We do this for two reasons. First, the dummy variables for the occurrences of crimes can only be 0 or 1, so we can comfortably compare the regression coefficients on different types of crimes (the scales of Gini coefficient and race data aren’t the same, and even if we normalize different predictors, it still isn’t clear that we can compare their numerical values, and the numerical values of their associated slopes.) A higher slope means a stronger correlation between a type of crime and a certain measure of inequality.

We found that the slopes of the crime-type predictors were always positive, which means that increasing rates of any type of crime is associated with an increase in every measure of inequality, which makes sense. Below, we list the relative strengths of the correlations between crime types and each measure of inequality:

Results of categorical linear regression: using crime type to predict measures of inequality.
Gini coefficient was most strongly predicted by Public, followed by Drugs and Force.
Percentage of people in high-income housing was most strongly predicted by Money, followed by Public and Force.
Percentage of people with low education was most strongly predicted by Deaths, followed by Private and Drugs.
Percentage of people with high education was most strongly predicted by Money, followed by Public and Drugs.
Percentage of people in poverty was most strongly predicted by Death, followed by Private and Public.

There seemed to be no apparent trend between which category of crime is associated with all or most measures of inequality. We conclude that the occurrences of all types of crimes tend to increase with any measure of inequality, but that no type of crime tends to increase more than the others.

Results

Project Trajectory

In this section, we outline some of the modeling decisions we made, both when answering the lower-level questions and when working on the large model for predicting types of crime.

In the lower-level question about streetlights, we realized that the geographic distribution of anything relative to a fairly uniformly-spaced grid (such as that of streetlights) is uniform in area but not in distance. That’s because the derivative of the area of a circle with respect to its radius increases linearly with the radius. However, lengths are easy to interpret; we decided that since streetlights shouldn’t matter for crimes committed during the day, we could use daytime crimes as a “null” or “placebo” test group on which to compare crimes committed during the night, for which distance to the nearest streetlight might in fact matter.

In the lower-level question about inequality, we were having difficulties distinguishing how important each predictor was to the different types of crime. This is because the different predictors of inequality (for example, percentage of people in high income housing and property values) are not directly comparable, even if we normalize them. For example, it’s very reasonable to have a percentage of people in high income housing approach zero for very poor neighborhoods; however, it’s not reasonable for the low-education residents approach zero in any neighborhood. We came up with a creative way to resolve this problem by flipping the predictors with the responses and using the occurrences of different types of crimes to predict the measures of inequality. The probability that a certain crime occurs is directly comparable with the probability another crime occurs, and flipping responses and predictors doesn’t change whether they are correlated.

The trajectory of the project changed slightly when we decided to change the categories of crimes that we were predicting; we eliminated No_offense and Other as crime categories. This allowed to to train better and more interesting models, since crimes (or, not even crimes) which go in the No_offense and Other categories weren’t very related even just by the types of crimes we decided to put in these categories. In contrast, the kinds of crimes we included in Death, for example, were superficially similar to each other. Indeed, we got higher accuracy after we eliminated the two “umbrella” categories.

In addition, in an attempt to improve upon the perceived poor performance of our baseline model we decided to include many more predictors than originally intended. We added data on racial makeup of the census tracts as well as more predictors of inequality, such as percentage in high-income housing. We therefore had to incorporate another section to our answer to the to the lower-level question about inequality: which measures of inequality matter?

Results

Our best model was an optimized random forest model which had an accuracy on the test set of 56.6%. We used a random grid search with cross-validation to find the best hyperparameters. The best parameters found only led to a very marginal improvement of the random forest model form 56.3% to 56.6%.

We can Evaluate our model by looking at a confusion matrix and ROC scores

We plotted the AUC curve for each class of crime as well as the overall average. Here the auc is calculated for detecting each class relative to each of the other classes combined. Class 0 is the property class which is the largest class by far. Our model had the best success correctly classifying property crimes. Class 5 which hugs the lower corner was the class for crimes involving death. This was by far the smallest class with only 53 occurences in the test set. Our model struggled to predict this class given its rare occurrence. However, it was sophisticated enough to detect some true positives. Overall on average the model fell below the accuracy given a naive classifier. However, this does not take into account the fact that we are trying to predict between seven different classes and not just two.

From the random forest model we can also get a ranking of feature importance. We combined the important scores for all of the dummy encoded day of the week variables into a single score. Ultimately this combined predictor was ranked second in importance. The top predictors were all temporal predictors that were derived from the original dataset. This makes sense given the importance of temporal predictors seen in our lit review. The next most important predictors are spatial predictors that we derived based on the precise location of the crime. The AV_total and mean prop value predictors were derived from the conditions within a certain radius of the crime. The distance metrics also seem to be important. None of the demographic predictors from the census data were particularly important. This may be because the census tracts are not granular enough or because these are very general features that do not affect that types of crimes. We expect that these predictors could be important for predicting overall crime rates within a certain census tract. The extremely low value for population can be explained by the fact that census tracts are deliberately chosen to have approximately the same population.

Impact

In general, it is important to take systematic approaches to policing and using data can allow law enforcement agencies to better serve their communities. The problem posed in this project is probably not the one of greatest interest to law enforcement because it predicts types of crime based on police reports where they already know the type of crime. However it could prove to be useful to supplement officer’s intuition about the situation when responding to a call at a certain location at a certain time. Given that officers are typically experienced and well trained are are likely to have domain specific knowledge about recent trends in crime, gang hotspots etc, we believe the additional value our model would add is low. Another way our model could be used is for forecasting trends into the future. It could be used to evaluate how the distribution of crimes will change as the city goes through changes in terms of demographic breakdown and geographic features. For example, one could forecast how gentrification could affect the policing needs of a neighborhood.

Page updated

Google Sites

Report abuse