For this project we were given three data sets:
Our primary data were from Crime Incident Reports from the Boston Police Department. This data includes information about incidents responded to by the Boston Police including type of crime, location of crime, and when the incident occurred and ranges from June 2015 to today.
Our secondary data sets included Property Assessment Data which includes information on the value and features of properties within Boston. This data set was used as a proxy for economic inequality. To further round out our measure of inequality we also used data from the US Census.
We tried to incorporate locational data in to our model. Street Light Locations data. In addition we used data detailing the location of colleges and universities, public and private schools, public libraries, and Hubway stations.
All of our data was taken from the Analyze Boston website with the exception of the census data.
High level question: What geographic and socioeconomic factors are associated with which types of crimes? What are the associations (linear, etc.)?
Low level question 1: What is the relationship, if any, between types of crime and proximity of the crime to the nearest streetlight? What about proximity to other common landmarks such as libraries, schools, and Hubway stations?
Low level question 2: What is the relationship, if any, between inequality and types of crime? We use income, Gini coefficient, and property values as proxy measures of inequality.
To make our results more interpretable, we reduced the number of different types of crime. There were originally 68 categories the Boston Police use to classify crimes; we created seven overall categories in which to further classify crime. They are: Property, Drugs, Private, Public, Force, Money, Death, No_offense.
The breakdown of crimes in our collapsed categories is shown to the left. These are reasonable categories because most of the categories are of the same size. We included crimes involving death even though the numerical count was very low relative to the other crimes (less than 300 incidents). From a utility standpoint it is a useful goal to try to predict the rare crimes that result in a death, given that this is the most serious type of crime.
To simplify prediction, we decided to remove the no offense and other crime. We were more interested in predicting the type of crime given that a crime had occurred than in predicting whether a crime had occurred or not.
For our secondary locational data sets, we were primarily concerned with the crime's distance in "degrees" to the nearest place of interest (school, library, Hubway station, etc.). We thought that considering only the nearest place of interest and ignoring all further places would be reasonable, since most of these places are spaced-out by design; thus, we expect only the closest site to affect the type of crime. For example, it is highly unlikely that someone committing a crime would be influenced by being in the vicinity of two schools instead of one. The distance in degrees to the nearest site of interest was usually quite small (perhaps a few thousandths) and the distribution of distances was skewed towards higher distances. To increase the spread of distances and make the distribution more nearly-normal, we log-transformed all distance variables. For example, see the below graphs:
Our census data gave us information on the median income, median age, Gini coefficient (measure of inequality), and property values, population density, in each census tract. To use these data, we used an API to map the latitude and longitude of a crime to the corresponding census tract. In this way we were able to tie demographic information about the land and the population living in the area to each crime committed. Since census tracts are evenly split by population, with around 1000 people per tract, this information is granular enough to be useful.
When reconciling the crime data with census tracts, we noticed that some of the census data had no median incomes or property values recorded. We just ignored this and do not expect it to be significant since this was true only for around 1% of the data.
At the end of our data cleaning and feature selection we ended up with 44 predictors and 343,965 observations. Our predictor variables are summarized below:
Here will summarize the interesting results from our EDA findings.
Certain types of crime did seem to have a relationship with time, namely hour of the day and day of the week. As we can see from the graph on the right, drug related crimes seem to occur frequently around 5:30 PM, money related crimes are more common around 12:00 PM . All our crimes have a noticeable reduction between the hours of 2:00 - 6:00 AM with death related crimes less likely during daylight hours.
Below we see that the likelihood of four different types of crime (death, drugs, money, public) change over the course of a week. Notably less drug, money, and public crime happen during Sunday and Saturday and death related crimes seem to occur less during Monday and Saturday, but frequently during Sunday. Other types of crime (property, private, force, no offense, other) did not occur more or less frequently depending on day.
Most of our locational data did not show significant differentiation between our 7 crime categories. As seen below, both private and crime and death related crimes seem to be positively associated with greater distance from the nearest Hubway station as well as distance to the nearest college.
The majority of our crime categories did not differ significantly based on the racial makeup of an area. Census tracts with a high percentage of Asians had very low rates of crime across the board. Death related crimes seemed more likely in areas with a higher percentage of Hispanic and Black populations. The rate of private crime goes up in areas with large Hispanic populations and the rate of property crime increases in majority White populations.
Private crime and death related crimes were also associated with areas with lower median income, lower percentage of high income households, lower percentage of highly educated individuals, and as a complement higher levels of low educated individuals.
Considering Gini coefficient we find that death related crimes are occur more in tracts with lower Gini value.
On the right we have a mapping of our crime categories across various census tracts in the city of Boston. Districts of note:
Unfortunately the majority of our crime categories seem to differ very little from each other on many of our variables. The exception being death related crimes and to a lesser extent private crimes which do show noticeable differences in likelihood in some of our locational and demographic data. We expect that the time variables relating to the dat of the week might be the the best predictors of different types of crime as of EDA has shown the most variation between them with hour of the day and day of the week.