Baltimore Crime Prediction

Monty M. Rahman

Introduction

Data Set

Story

Hypothesis / Research Question(s)

Data Set

Implementation (Model)

Results

Here we are observing the what the model predicts will occur on the days between March 2021 and April 2021 that it has trained itself on. To counter the change in crime occurrences per month, I have chosen the days between 6 and 7 years (2250 and 2300 days since 2014) from the starting point of the data set which we defined earlier to be all data points past 2014.

As previously noted, the data for the remainder of the year is still TBD, but because the dataset shows a decrease in crime activity the model picked up on the trend and assumes that the future, based on historical data, will continue to follow suit.

References

Links (Optional)

GitHub Repository for your capstone project: Insert Link (e.g. https://github.com/UMBC-Data-Science/ )

Phase 1

Introduction

An unfortunate reality is the existence and frequency of crime. Baltimore city is ranked at fourth place for most dangerous cities in the United States. In 2019 there was a 1 in 55 chance that you will be a victim of a violent crime (rape, sexual assault, robbery, assault, and murder). Prediction of potential crime and occurrence offers a potential solution to the city. Predicting crimes occurrences is important to the community because protective and preventative measures may be implemented. Understanding what and when a crime may occur will give government and community leaders a more accurate approach to correcting root causes of the dillema. For example, understanding that robberies occur at a greater frequency at a specific location can be beneficial to the local government in understanding why it is happening and how to resolve it.

Data Set

API where I will be calling real time data to be used in the model: https://egis.baltimorecity.gov/egis/rest/services/GeoSpatialized_Tables/Part1_Crime/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json

The data is being called from Baltimore Open Data (https://data.baltimorecity.gov/) a public data hub that allows users the ability to observe, interact, and analyze data regarding the city of Baltimore. If historical data does not suffice to train the model, documented crime data in CSVs may be merged to the original dataset via SQL join function. The data types contained in the data set are ints, varchar, floats, coordinates, and datetime.

The unit of analysis for this project will be

Story

With Black Lives Matter at the peak of its movement in the United States, data sources and evidence regarding police interactions will be extremely useful as evidence to address the situations that are impacting our nation. A machine learning model will predict and define the crime occurrence as to allow what type of resources to utilize for the interaction. For example, if domestic disputes or drug usage is predicted as the majority occurrence for a month by the model, the police department may increase counselors or individuals specialized in conflict resolution. Having a higher number of these specialized team members to be dispatched to an area will be of greater service than an officer who may potentially raise the level of tension. According to the American Community Survey (https://worldpopulationreview.com/us-cities/baltimore-md-population) Baltimore, Maryland demographic consists of an approximate 62.35% Black and African American population. With Black Lives Matter protests historically and currently occurring in Baltimore, this majority demographic population would greatly benefit from the results of the model. If successful this model may be used with crime data from other cities, states, and counties to help lift the burden that our nation and many are facing.

Hypothesis / Research Question(s)

Using the features present in the dataset (RowID, CrimeDateTime, CrimeCode, Location, Description, Inside_Outside, Weapon, Post, District, Neighborhood, Latitude, Longitude, GeoLocation, Premise, VRIName, Total_Incidents) I will create a model that can predict crime occurance, location, and type. If the machine is trained on enough data then it may predict an occurance of a possible crime.

Data Set

attributes.RowID int64

attributes.CrimeDateTime int64

attributes.CrimeCode object

attributes.Location object

attributes.Description object

attributes.Inside_Outside object

attributes.Weapon object

attributes.Post object

attributes.District object

attributes.Neighborhood object

attributes.Latitude float64

attributes.Longitude float64

attributes.GeoLocation object

attributes.Premise object

attributes.VRIName object

attributes.Total_Incidents int64

geometry.x float64

geometry.y float64

The Open Baltimore dataset api shows location, DateTime, Description, Indoor/Outdoor, weapon, post, district, neighborhood, and geographical coordinates dating back to 2018. API is constantly being updated as new police entries are logged. API was published on March 4, 2021, is public record, and does not require a license for usage.

attributes.RowID Unique ID for encounter

attributes.CrimeDateTime Date and Time of encounter

attributes.CrimeCode Code for encounter

attributes.Location Address of encounter

attributes.Description Description of encounter

attributes.Inside_Outside If encounter occurred in or out

attributes.Weapon Weapon was involved Y/N

attributes.Post Code for location

attributes.District N E S W with respect to Baltimore

attributes.Neighborhood Neighborhood of encounter

attributes.Latitude Geographical Coordinate

attributes.Longitude Geographical Coordinate

attributes.GeoLocation Geographical Coordinate

attributes.Premise Description of encounter area

attributes.VRIName NAN

attributes.Total_Incidents Incident occurrence

geometry.x Geographical Coordinate

geometry.y Geographical Coordinate

Exploratory Data Analysis

data['attributes.Description'].value_counts().plot(kind='bar')

Using Python we can see that the most frequent crime in Baltimore is Larceny, followed by Larceny from auto, then Common assault, etc.

Without any machine learning models it can be presumed that our model will predict Larceny as the primary crime. Now we will focus on the times, occurrence and frequency of each attribute description.

From the Crime count with respect to year we can see that the crime levels are relatively balanced throughout the years. It appears that crimes have lowered from the years 2017 to 2020. 2021 is still in progress and is not fully representative of the year.

Taking a deeper look at the top five crimes ( LARCENY 75447, COMMON ASSAULT 58026, BURGLARY 47268, LARCENY FROM AUTO 43748, and AGG. ASSAULT 39021) they too all follow the same trend of lowering over the years.

Where are you most likely to experience / witness a crime?

To dive deeper into my exploratory data analysis I will plot the occurance of crimes onto an actual map of Baltimore using geopandas, descartes, matplotlib, and crs geometry. Because my data set has Latitude and Longitude I am able to create a geopandas dataframe that is joinable to my crime dataframe. The merge defines the coordinates to polygons and points that can be mapped onto a shapefile of the map of Baltimore.

I will then label specific crimes to specific markers and colors to be able to clearly identify in what locations of Baltimore do what crimes occur most frequently. This information will be useful in the future for the model to predict where specific crimes are most likely to occur.

%%time

# Important library for many geopython libraries

!apt install gdal-bin python-gdal python3-gdal

# Install rtree - Geopandas requirment

!apt install python3-rtree

# Install Geopandas

!pip install git+git://github.com/geopandas/geopandas.git

# Install descartes - Geopandas requirment

!pip install descartes

# Install Folium for Geographic data visualization

!pip install folium

# Install plotlyExpress

!pip install plotly_express

import geopandas as gpd

from shapely.geometry import Point, Polygon

Baltimore_Map = gpd.read_file("/content/Maryland_Baltimore_City_Neighborhoods.zip")

Baltimore_Map.head(50)

Was able to partially correct the GeoPandas error from the previous presentation. Color code the unique districts of Baltimore.

To further extend the exploratory data analysis I used a pivot table to visualize the count of unique crimes in Baltimore.

crimes_count_date = df.pivot_table('CrimeCode', aggfunc=np.size, columns='Description', index = df["CrimeDateTime"], fill_value=0)

crimes_count_date.index = pd.DatetimeIndex(crimes_count_date.index)

plo = crimes_count_date.rolling(365).sum().plot(figsize=(30, 30), subplots=True, layout=(-1, 3), sharex=False, sharey=False)

Implementation (Model)

I will be using the ARIMA model to generate my crime forecast and prediction.

Why ARIMA?

An ARIMA (Autoregressive integrated moving average) model is a statistical model that uses data points that are generated at unique and specific times. In our Baltimore Crime data set, the majority of the crimes logged are coupled with a DateTime occurrence. These points coupled with their district location, and crime offense can be used to essentially predict the future. When ordered into an ascending order a trend may be visualized via the ARIMA forecast plot. An ARIMA models use four components to describe the time series data. Trend( an upward or downward nature), Seasonality ( if a pattern within a repeating time period occurs frequently), Irregularity ( factors outside of the data set that may influence the results), and Cyclic ( movement of graphed data). An ARIMA model is comprised of 3 key variables that determine the models prediction: p, q, and d. p = the amount of auto regressive terms, d = amount of deltas for the graph to remain flat, and q =.the amount of forecast errors for the model.

The two plots above are of are a combination of the forecast and the actual plot of our time series data. From the Baltimore police department data provided, the ARIMA model forecasts an increase in crime as seen in the positive trend tail of the graph. The forecasted portion of the graph is shadowed in gray so that it can be easily distinguishable from the historic police department data.

Before we got to our model forecasting, I visualized the time series by two orders of differencing.

ARIMA Model Results

==============================================================================

Dep. Variable: D.Month No. Observations: 2372

Model: ARIMA(1, 1, 2) Log Likelihood -8079.028

Method: css-mle S.D. of innovations 7.291

Date: Thu, 22 Jul 2021 AIC 16168.055

Time: 00:03:39 BIC 16196.913

Sample: 1 HQIC 16178.560

=================================================================================

coef std err z P>|z| [0.025 0.975]

---------------------------------------------------------------------------------

const -0.0046 0.013 -0.365 0.715 -0.029 0.020

ar.L1.D.Month 0.3207 0.155 2.063 0.039 0.016 0.625

ma.L1.D.Month -1.0973 0.161 -6.829 0.000 -1.412 -0.782

ma.L2.D.Month 0.1544 0.145 1.062 0.288 -0.130 0.439

Roots

=============================================================================

Real Imaginary Modulus Frequency

-----------------------------------------------------------------------------

AR.1 3.1180 +0.0000j 3.1180 0.0000

MA.1 1.0735 +0.0000j 1.0735 0.0000

MA.2 6.0347 +0.0000j 6.0347 0.0000

-----------------------------------------------------------------------------

<-- Model Summary

<-- This table is the coefficients table. "Values under the coef column are the weights of the respective terms" (https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/)

Using a residuals.plot I was able to visualize the mean and uniform variance. The mean and uniform variance being close to 0 is a good indicator.

I then used model_fit.plot_predict to plot both the predicted plot as with the actual plot. If you look closely you may see a blue line. This line is the forecasted plot from our data. We will use this to attach to our actual data. The new piece, as seen above, is the forecasted data.

Standardized residual: The residual is floating around an average of 0

Histogram plus estimated density: Again this graph is centered around a distribution of 0.

Normal Q-Q: The distribution is not skewed because the blue dots are in line with the red lone

Correlogram: Also know as an ACF plot shows that the errors are correlated. This shows that there is no pattern in the data.

Results

From our ARIMA Model I was able to determine the forecast of crime activity. Because our data set column CrimeDateTime is not a complete set for the present year (2021) ending on July, the data appears to indicate a drop in crime counts per month. This is not because crimes in the area are lessening but rather because when the data from the API was pulled, the year was still in progress. The ARIMA model will predict that crimes are in a downward direction but this is not the case. Something I can do in the future to test the datas accuracy would be to not test the model against future results, but rather compare it against crimes that have already occurred and see how accurate it is at predicting those values.

Here we are observing the what the model predicts will occur on the days between March 2021 and April 2021 that it has trained itself on. To counter the change in crime occurrences per month, I have chosen the days between 6 and 7 years (2250 and 2300 days since 2014) from the starting point of the data set which we defined earlier to be all data points past 2014.

results.plot_predict(2250,2300)

As previously noted, the data for the remainder of the year is still TBD, but because the dataset shows a decrease in crime activity the model picked up on the trend and assumes that the future, based on historical data, will continue to follow suit.

Future Plans: Something I can do in the future to test the ARIMA models accuracy would be to not test the model against future results, but rather compare it against crimes that have already occurred and see how accurate it is at predicting those values. For example, graphing between the months Jun2021 - July 2021 that have already occurred, removing those dates from the DataFrame, and having the model predict what those levels of crime activity would be. This way I will have a known and "unknown" comparison to visualize the models accuracy.

https://youtu.be/lE-kRzSZloE

References

https://www.sallerlaw.com/baltimore-crime-statistics/
http://web.engr.oregonstate.edu/~tgd/publications/mlsd-ssspr.pdf
https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/
https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf

Links (Optional)

https://www.linkedin.com/in/monty-rahman-543397142/

https://github.com/RahmanMonty

https://github.com/RahmanMonty/Data606Capstone/tree/main

Page updated

Report abuse