Project Report & Overview

1. Problem Statement and Motivation

Whether on a global, national or local scale, we constantly struggle to prevent and reduce crime. Boston, MA features a higher rate of crime compared to the rest of Massachusetts and the United States, particularly with regard to violent crimes. In this project, we analyze some of Boston’s demographic and geo-spatial features and determine if they can help predict the likelihood of different types of crime. While stopping crime is extremely difficult, we hope that if we can identify some of its strongest underlying causes, we can develop more effective measures to aid in its reduction.

2. Introduction and Description of Data

***For in-depth data exploration, see our Data section.

Our primary dataset outlines around 400,000 reported crime occurrences in Boston between 2015-2019. We added various additional datasets that we used as predictor variables. See a list and description of those variables here. Each predictor dataset contains individual landmark points and corresponding latitude and longitude coordinates.

In addition to our basic data cleaning, the Property Assessment dataset posed a particular challenge. This dataset lists all Boston area properties and their values and addresses, but does not include latitude and longitude data. To deal with this, we used a Boston Street Addresses dataset which mapped Property ID's from the Property Assessment dataset to coordinates. See our process here.

For each predictor we uploaded, we needed to calculate distances to each crime based on their coordinates. We used a haversine function adapted for numpy implementation to accomplish this. See a full description of those calculations here.

3. Literature Review/Related Work

While researching this project, we came across a study called Exploring the Link between Crime and Socio-Economic Status in Ottawa and Saskatoon: A Small-Area Geographical Analysis written by Peter Kitchen, Ph.D. of Canada's Department of Justice. This study examined various socioeconomic determinants in neighborhoods throughout Ottawa and Saskatoon, and used those to model varying crime rates between those neighborhoods.

At first, we considered a similar approach to the Boston crime data. We found datasets that outlined demographic details in each of Boston's neighborhoods. However, we abandoned this approach for a number of reasons. First, the Boston neighborhood data represented a particularly small sample size, as there are only around 20 neighborhoods in the dataset. While statistical modelling is still possible on this type of data, the data is less reliable, and the data would violate many of the underlying assumptions in basic statistical models.

Furthermore, we felt it would be more meaningful to examine each crime individually. While neighborhood data is useful, neighborhoods are relatively large, and do not necessarily provide insight into smaller regions around crimes. By looking at each individual crime, we could take a more granular approach in our analysis. For example, the total number of streetlights or the average property value in an entire neighborhood might not relate to those statistics in the immediate vicinity of a given crime. We therefore determined it would be best to use each reported crime as our observations, and not to study crime rates in each of Boston's neighborhoods.

4. Modeling Approach

***For in-depth modelling, see our Models section.

For our original categories as well as our new categories, we took a systematic modelling approach, starting with baseline models and moving on to more complex iterations. We trained and tested the following models on our datasets:

Simple Logistic Regression, classifying most likely crime type based on distance to the nearest streetlight.
Simple Logistic Regression, classifying most likely crime type based on the total number of streetlights within one block.
Multivariate Logistic Regression, classifying most likely crime type based on all of our predictors.
kNN Classification, classifying most likely crime type based on all of our predictors.
Ensemble and Tree Methods, including Decision Tree Classifiers with best tree-depth cross validation, Bagging to bootstrap and aggregate trees, and Random Forest, to eliminate issues stemming from highly-correlated trees in the Bagging method.
Neural Networks, including a regularized Neural Network on our new categories dataset.

After fitting our models, we calculated model accuracy on our pre-divided test datasets and analyzed variable importances based on the various models.

5. Project Trajectory, Results and Interpretation

***For in-depth results, see our Results section.

We initially built classification models based on 5 different crime types, then later revisited those categories and chose 5 new ones (see our motivation here). We trained and tested all of our models on the "Original" 5 categories and later on the "New" 5 categories.

The "original" categories were:

Violent Crimes
Larceny
Break-Ins
Property Damage
Drug-Related Crimes

Based on the models that we developed predicting the original categories, we did not observe much increase in accuracy versus just choosing the most common category. The best model was Random Forest but it was just above 51%, not much higher than the 46.65% score to beat. These poor results added to our motivation to examine new crime type categories.

In predicting the original crime categories, police station distance was the most important variable in all models, followed by property value average. Though our models performed relatively poorly, this may indicate that further study could prove useful in determining if police station location is a determinant of crime type.

The "new" categories were:

Homicide
Landlord/Tenant Disputes
Verbal Disputes
Prostitution
Aggravated Assault

Based on the models that we developed predicting the new categories, we observed some increase in both test and train accuracy over the "most common category" method (59.06%). Our best model, the regularized Neural Network model, performed at 67.44%. Using the new categories provided a slight improvement in predictive power versus the original categories.

In predicting the new crime categories, average property value was the most important variable followed by college and university distance. This could indicate a discrepancy between prevalent crime types based on property values, which is in turn an indicator of wealth and income disparity.

6. Conclusions and Future Work

For our original crime categories, though our models performed relatively poorly, they may indicate that further study could prove useful in determining if police station location is a determinant of crime type. If so, this could help cities decide strategic placement of police stations in order to minimize certain types of crime.

Our improved results for the new categories could indicate a discrepancy between prevalent crime types based on property values, which is in turn an indicator of wealth and income disparity. This could lend insight into the effects of income inequality as a whole on prevalent crime types. This can prove particularly useful in determining if treating income disparity itself could lead to a reduction in those types of crime. With regard to colleges and universities as well, discrepancies in crime type could signal to law enforcement what to prioritize near college campuses.

Page updated

Google Sites

Report abuse