The main goal of this capstone project is to predict what crime took place in a specific location in Baltimore City on a certain date and time in 2019 using predictive algorithms/models. I will conduct additional analysis on the dataset to determine seasonal trends and what areas are prone to crime since all of these trends could affect the algorithm’s predictions. I will create visualizations of the trends found in the data as well as map where and how many crimes took place in Baltimore City. After the algorithm has been fine-tuned and finalized, I will use it on the Chicago crime dataset in order to compare how well the algorithm/model did on a different set of data.
Dataset
The primary dataset I will be using is the BPD Part 1 Victim Based Crime Data1 from data.baltimore.gov. This dataset accounts for reported crimes involving victims from 1963 to the present in the city of Baltimore. This dataset is updated with new information weekly. There are 292k rows and 16 columns. The columns include the crime date, crime time, crime code, location, description, inside/outside, weapon, post, district, neighborhood, longitude, latitude, location 1, premise, vri_name, and total incidents. The second dataset I will be using is the Crimes 2001 - Present2 from data.cityofchicago.org. This dataset reflects reported incidents of crime, excluding murders, that occurred in Chicago from 2001 to the present. There are 7.06M rows and 22 columns. The columns include the ID, case number, date, block, IUCR, primary type, description, location description, arrest, domestic, beat, district, ward, community area, FBI code, X coordinate, Y coordinate, year, updated on, latitude, longitude, and location.
Methodology
First, I will clean and normalize both the Baltimore and Chicago crime datasets. Then, I will analyze and create visuals of the data trends as well as the crime maps for the Baltimore dataset. Next, I will try the XGBoost, Random Forest, KNN, and multivariate regression algorithms on the Baltimore dataset and receive the initial accuracy results. From there, using various technologies like GridSearchCV and feature engineering, I will refine the algorithms with the most promise. After fine-tuning the algorithms, I will finalize which algorithm to use and test it on the Baltimore dataset. Lastly, I will use the algorithm on the Chicago dataset and compare the accuracy results against Baltimore’s to see how well it does on a different dataset.
Literature/Industry Research and Outcomes
PredPol is a popular predictive policing software tool in the industry today that is similar to my capstone project. PredPol is a software tool created by scientists at UCLA that detects patterns of criminal behavior and helps police choose where and when to patrol. After generating its predictions a red box is then created on a Google Maps web interface representing the highest risk areas for crime to occur. PredPol’s predictive algorithm is ძA/ძt = B + ηD/4⛛2 A - ωA + θωδ and uses 5 data points: incident identifier, crime/event type, location of incident, timestamps/time for incident, and record modified date/time for incident. This software is currently used in 60 police departments around the country.
Project Differentiation
PredPol predicts the area where crimes are most likely to occur and when, but they don’t predict what crime will potentially take place. My project will attempt to predict when (date and time), where (latitude and longitude), and what crime (burglary, arson, etc.) will potentially occur in Baltimore City. Also, my algorithm/methodology will differ from PredPol’s because it is a multi-target, multi-variate classification/prediction problem. I will use multiple algorithms at different stages to predict the above targets.
Transformation, Data Cleaning, and Exploratory Data Analysis
I cleaned and transformed both the Baltimore and Chicago crime datasets, but only performed EDA on the Baltimore crime dataset since it is the main focus of my project. In both datasets I handled null entries, data types, data formatting, and dropping columns not relevant to prediction. For EDA, I looked into the distribution of premises where crimes occurred, weapons used in crimes, seasonality trends in crime, and plotted where crime occurred multiple times for each year from 2014-2019. For seasonality trends, crime tends to start low in January, hits its lowest point in February, climbs starting in March, peaks in August, and then drops again around October.
To view the interactive versions of the maps and graphs, please see the Jupyter Notebook (Deliverable2.ipynb) in the Deliverable 2 Folder on GitHub.
Annual crime in Baltimore from 2014 - Present that Depicts a Seasonality Trend.
Total monthly crime counts in Baltimore summed from 2014 - Present.
Multiple occurrences of crimes in the same location mapped in Baltimore for 2019.
For the multiple incidents of crime by location (mapped by year), it was clear to see that crimes tend to cluster around the same areas. This was also evident in seeing the same areas have higher crime counts from year to year in the same locations. This is a static version of the map, but the interactive version, along with the 2014-2018 maps, is available in the Jupyter Notebook on GitHub (Deliverable2.ipynb).
Capstone Changes
Some major changes have come to my capstone as a result of Deliverable 3. I will no longer be predicting when (CrimeTime) crimes occur, only where and what crimes occur in Baltimore City for 2019. I have come to this conclusion after trying multiple methods to classify time and obtaining less than 10% accuracy for almost every method. Below outlines the various ways I have tried to classify time:
As a third output in the Multiclass-Multioutput Random Forest Classification model
Tried to classify CrimeTime as it was originally
Tried to classify only the hour the crime occurred
Tried to classify a range like “early morning” (12a - 5a)
This is the only method that got an accuracy score of 30%
With algorithms like KNN, Decision Tree, and SGD to classify time using the Multiclass-Multioutput Random Forest Classification model’s predictions for where and what as additional features
Tried to classify CrimeTime as it was originally
Tried to classify only the hour crime occurred
With algorithms like KNN, Decision Tree, and SGD to classify time without the features for where and what
Tried to classify CrimeTime as it was originally
Tried to classify only the hour crime occurred
Data Preparation
Some of the data preparation steps were brought over from Deliverable 2, such as cleaning the data and handling null values. Other data preparation steps that needed to take place was splitting and encoding the data. I split the 2014 - 2018 data into the train set and the 2019 data into the test set. Once the data was split I needed to encode it, since most of the data was text/categorical. I used Label Encoder from Scikit-Learn to accomplish this. Label encoder changes text data to numbers from 0 to n-classes - 1 for as many unique features are in that column. Once all my data was encoded, I split the training data only (2014 - 2018) into additional train and test splits, using train/test split, for model construction purposes. I decided to do this so I could control the random state of the data as I did model construction and try to avoid overfitting during training.
Model Construction
I needed to construct a model that could handle multiclass labels and multioutput labels. Multiclass means there are multiple options per label. An example would be the different types of crime, the label could be burglary, arson, or homicide. A multioutput label is multiple labels, so classifying what type of crime and where it occurs. Scikit-Learn has a model, MultiOutputClassifier, which can handle multiclass and multioutput classification. It works by taking any classifier as input and giving the multioutput-multiclass classification predictions as the output. I decided to use a Random Forest Classifier as the input to the MultiOutputClassifier. I came to this conclusion because the model trained very quickly and received great accuracy results without any parameter tuning. A random forest classifier is composed of a large number of decision trees. These decision trees act as an ensemble. Each decision tree will make a prediction for each label and the class with the most votes become the model’s predictions. It is a powerful algorithm because even if some of the trees make the incorrect prediction it is possible to still come to the correct conclusion which receives better accuracy results.
Model Fine Tuning
I used GridSearchCV to tune the parameters of my Random Forest Classification model and feature engineering to improve the results of my model by adding more features. For GridSearchCV, I gave three possible parameters to 4 different estimators of my model; N estimators, max depth, min samples leaf, and min samples split. GridSearchCV searched for the best results using the different parameters for different combinations of the estimators. Once GridSearchCV was completed I passed the best estimators to my Random Forest Classification model to be used. For feature engineering, I added a combined latitude and longitude, month, day, year, hour, minute, second feature. The combined latitude and longitude feature became the 5th most important feature to my model while the month, day, year, hour, minute, and second feature remained the least important. As a result of this, I only kept the combined latitude and longitude feature in my final model.
Accuracy Measures
Scikit-Learn does not currently support accuracy measures for multiclass-multioutput classification models so I had to define my own. For deliverable 3 I defined 3 simple accuracy measures: Overall Accuracy, Hamming Loss, and Accuracy per Label. Overall accuracy sums the correct labels and divides by the total number of labels. I do this per element/prediction for each row instead of by each row where every element needs to be correct in order to receive a 100%. Hamming loss is the sum of incorrect labels divided by the total number of labels, again per label for each row. Accuracy per label is the sum of correct labels per label divided by the total number of labels per label. I did this so I could see how well the model predicts both what and where crimes occur separately. Good results for overall accuracy and accuracy per label are indicative of results closer to 1 while good results for hamming loss are indicative of results closer to 0.
Trials
I tried many different things throughout deliverable 3. Not all of them worked but I learned a lot and they ultimately lead me to create a better model. One thing I used for encoding my data originally was Scikit-Learn’s ColumnTransformer. What this does is use LabelEncoder and then takes each unique label and turns it into its own column. If that feature is in that column it becomes a 1 in that row and if it isn’t it becomes a 0. This turned my dataset into a huge sparse matrix and I was only receiving around 60% accuracy results. When I looked into just using the label encoder I received much better results.
Another thing I tried to do was scale my data when it was both a sparse matrix and label encoded only. I used MinMaxScaler on the label encoded data and MaxAbsScaler on the sparse matrix. Scaling the data neither hurt or helped my model’s results which is why I decided to not use it.
Model Results
With the parameter tuning and feature engineering, I received the following results for my training and test sets (the full training and test sets, not the train/test split of only the training data).
Train Set (2014 - 2018)
Hamming Loss: 0.0089
Overall Accuracy: 0.9910
Description Accuracy: 0.9999
Neighborhood Accuracy: 0.9820
Test Set (2019)
Hamming Loss: 0.0326
Overall Accuracy: 0.9673
Description Accuracy: 1.0
Neighborhood Accuracy: 0.9347
I am happy with the results I received from my Multiclass-Multioutput Random Forest Classification Model. It is easy to see that the predictions for neighborhood in the training set were overfit, however, I am happy with a 93% accuracy for neighborhood in the test set. There could be room for improvement, however, it is fine for now.
Conclusions and Next Steps
Overall, the Multiclass-Multioutput Random Forest Classification model classified the crime types and locations that occurred in Baltimore for 2019 very well. It was flexible and received pretty good results when it was used on the Chicago dataset as well. However, the Chicago dataset was fairly unbalanced and was larger so some of the accuracy measures were misleading until analyzing the classification report and confusion matrix. Some next steps would be to raise the precision, recall, and f1-scores of the model for the Chicago dataset to receive even more accurate classifications and even better flexibility when applying the model to different and larger datasets.
Some next steps would be to take another look at data that are indicators of crime and use those datasets when training the model. Through doing this a model could be created that predicts the crime type and location for crimes that have not occurred yet. I do not think my current model would do well at this task and a different model would probably need to be created to be successful in this endeavor. A model that understands time would be a good starting point for predicting future crime types and locations.
Final Model Methodology
Clean both the Baltimore and Chicago datasets
Split the data into training sets (2014-2018 data) and test sets (2019 data) for both Baltimore and Chicago
Encode text/categorical data
Train the Multi-class Multi-output Random Forest Classification Model
Pass the trained model to GridSearchCV for parameter tuning
Use the finalized model with the best parameter to predict what crimes occur where in 2019
Assess the results using overall accuracy, hamming loss, accuracy per label, classification reports, and confusion matrices
Model Results for Baltimore and Chicago Datasets
I received the below results for the Baltimore and Chicago datasets classifying what and where crimes occurred in 2019:
Baltimore
Overall Accuracy: 0.9729
Hamming Loss: 0.0270
Crime Type Accuracy: 0.9974
Crime Location Accuracy: 0.9483
Chicago
Overall Accuracy: 0.9584
Hamming Loss: 0.0415
Crime Type Accuracy: 0.9532
Crime Location Accuracy: 0.9636
While I am happy with these results and these results show that my model is flexible enough to be successful with the Chicago dataset, the Chicago accuracy scores are misleading. Looking into the classification report and the confusion matrix help to better understand the results.
Crime Maps for 2019 Crime Type and Location Descriptions
A map was generated for both the Baltimore and Chicago crime and location predictions for 2019. The map is colorized by the predicted location. In the notebook, you can hover over the interactive map to see the predicted crime for that location.
Classification Reports for Baltimore and Chicago Crime Type Predictions
Baltimore
Chicago
Overall, the model received high scores across all three metrics for classifying each type of crime. The only labels the model had trouble with were all robberies of varying kinds. This makes sense since its essentially the same crime but with additional specificity as to what kind of robbery took place.
Chicago’s crime type classification report shows the varying levels of how well each label was classified by the model. Understandably, the model had trouble classifying certain crimes since there was very little data to learn from in the unbalanced dataset. Solutions to this would be to find additional indicators of these crimes to include in the training set as well as have a more balanced dataset to train the model on.
Confusion Matrices for Baltimore and Chicago Crime Type Predictions
Baltimore
Chicago
As we can see, the model had the most confusion labeling the different kinds of robberies as seen through the classification report. Otherwise, the model classified the types of crimes very well.
Through the confusion matrix we can see that the model had the most trouble predicting the other offense. This is probably due to the ambiguous nature of this crime and how different the crimes in this category range from one another and are similar to other crime labels.