We originally split multiple crime categories into 5 broad categories: Violent, Break-Ins, Larceny, Property Damage, and Drug Violations. We then filtered our dataset to include only those crimes.
We also encoded SHOOTING as a binary variable, and added a NIGHT feature, which categorized every crime between 9pm and 4am as night-time crimes:
Next, we uploaded our Property Value data, grouped by property ID (PID) and mapped those PIDs to coordinate data using the Boston Addresses dataset:
We then uploaded our other predictor datasets. We used a haversine function to calculate distances between our predictors and each crime. Each "distance" predictor indicates the distance in meters of the nearest of each predictor to the crime. "Streetlight Density" indicates the total number of streetlights within an 80 meter (~one block) radius, and "Property Avg" indicates the average property value within a 240 meter (~three block) radius to the crime.
For the full calculations we used, see the Distance Calculations page.
After calculating all of the relevant distances, we added each predictor to our dataset:
We also split our data into train/test sets, and encoded our crime types as integers from 0-4.
Data Exploration for alternative, New Categories