Baltimore Crime Prediction
by
by
GitHub Repository for your capstone project: Insert Link (e.g. https://github.com/UMBC-Data-Science/ )
Phase 1
An unfortunate reality is the existence and frequency of crime. Baltimore city is ranked at fourth place for most dangerous cities in the United States. In 2019 there was a 1 in 55 chance that you will be a victim of a violent crime (rape, sexual assault, robbery, assault, and murder). Prediction of potential crime and occurrence offers a potential solution to the city. Predicting crimes occurrences is important to the community because protective and preventative measures may be implemented. Understanding what and when a crime may occur will give government and community leaders a more accurate approach to correcting root causes of the dillema. For example, understanding that robberies occur at a greater frequency at a specific location can be beneficial to the local government in understanding why it is happening and how to resolve it.
API where I will be calling real time data to be used in the model: https://egis.baltimorecity.gov/egis/rest/services/GeoSpatialized_Tables/Part1_Crime/FeatureServer/0/query?where=1%3D1&outFields=*&outSR=4326&f=json
The data is being called from Baltimore Open Data (https://data.baltimorecity.gov/) a public data hub that allows users the ability to observe, interact, and analyze data regarding the city of Baltimore. If historical data does not suffice to train the model, documented crime data in CSVs may be merged to the original dataset via SQL join function. The data types contained in the data set are ints, varchar, floats, coordinates, and datetime.
The unit of analysis for this project will be
With Black Lives Matter at the peak of its movement in the United States, data sources and evidence regarding police interactions will be extremely useful as evidence to address the situations that are impacting our nation. A machine learning model will predict and define the crime occurrence as to allow what type of resources to utilize for the interaction. For example, if domestic disputes or drug usage is predicted as the majority occurrence for a month by the model, the police department may increase counselors or individuals specialized in conflict resolution. Having a higher number of these specialized team members to be dispatched to an area will be of greater service than an officer who may potentially raise the level of tension. According to the American Community Survey (https://worldpopulationreview.com/us-cities/baltimore-md-population) Baltimore, Maryland demographic consists of an approximate 62.35% Black and African American population. With Black Lives Matter protests historically and currently occurring in Baltimore, this majority demographic population would greatly benefit from the results of the model. If successful this model may be used with crime data from other cities, states, and counties to help lift the burden that our nation and many are facing.
Using the features present in the dataset (RowID, CrimeDateTime, CrimeCode, Location, Description, Inside_Outside, Weapon, Post, District, Neighborhood, Latitude, Longitude, GeoLocation, Premise, VRIName, Total_Incidents) I will create a model that can predict crime occurance, location, and type. If the machine is trained on enough data then it may predict an occurance of a possible crime.
attributes.RowID int64
attributes.CrimeDateTime int64
attributes.CrimeCode object
attributes.Location object
attributes.Description object
attributes.Inside_Outside object
attributes.Weapon object
attributes.Post object
attributes.District object
attributes.Neighborhood object
attributes.Latitude float64
attributes.Longitude float64
attributes.GeoLocation object
attributes.Premise object
attributes.VRIName object
attributes.Total_Incidents int64
geometry.x float64
geometry.y float64
The Open Baltimore dataset api shows location, DateTime, Description, Indoor/Outdoor, weapon, post, district, neighborhood, and geographical coordinates dating back to 2018. API is constantly being updated as new police entries are logged. API was published on March 4, 2021, is public record, and does not require a license for usage.
attributes.RowID Unique ID for encounter
attributes.CrimeDateTime Date and Time of encounter
attributes.CrimeCode Code for encounter
attributes.Location Address of encounter
attributes.Description Description of encounter
attributes.Inside_Outside If encounter occurred in or out
attributes.Weapon Weapon was involved Y/N
attributes.Post Code for location
attributes.District N E S W with respect to Baltimore
attributes.Neighborhood Neighborhood of encounter
attributes.Latitude Geographical Coordinate
attributes.Longitude Geographical Coordinate
attributes.GeoLocation Geographical Coordinate
attributes.Premise Description of encounter area
attributes.VRIName NAN
attributes.Total_Incidents Incident occurrence
geometry.x Geographical Coordinate
geometry.y Geographical Coordinate
Exploratory Data Analysis
data['attributes.Description'].value_counts().plot(kind='bar')
Using Python we can see that the most frequent crime in Baltimore is Larceny, followed by Larceny from auto, then Common assault, etc.
Without any machine learning models it can be presumed that our model will predict Larceny as the primary crime. Now we will focus on the times, occurrence and frequency of each attribute description.
From the Crime count with respect to year we can see that the crime levels are relatively balanced throughout the years. It appears that crimes have lowered from the years 2017 to 2020. 2021 is still in progress and is not fully representative of the year.
Taking a deeper look at the top five crimes ( LARCENY 75447, COMMON ASSAULT 58026, BURGLARY 47268, LARCENY FROM AUTO 43748, and AGG. ASSAULT 39021) they too all follow the same trend of lowering over the years.
Where are you most likely to experience / witness a crime?
To dive deeper into my exploratory data analysis I will plot the occurance of crimes onto an actual map of Baltimore using geopandas, descartes, matplotlib, and crs geometry. Because my data set has Latitude and Longitude I am able to create a geopandas dataframe that is joinable to my crime dataframe. The merge defines the coordinates to polygons and points that can be mapped onto a shapefile of the map of Baltimore.
I will then label specific crimes to specific markers and colors to be able to clearly identify in what locations of Baltimore do what crimes occur most frequently. This information will be useful in the future for the model to predict where specific crimes are most likely to occur.
%%time
# Important library for many geopython libraries
!apt install gdal-bin python-gdal python3-gdal
# Install rtree - Geopandas requirment
!apt install python3-rtree
# Install Geopandas
!pip install git+git://github.com/geopandas/geopandas.git
# Install descartes - Geopandas requirment
!pip install descartes
# Install Folium for Geographic data visualization
!pip install folium
# Install plotlyExpress
!pip install plotly_express
import geopandas as gpd
from shapely.geometry import Point, Polygon
Baltimore_Map = gpd.read_file("/content/Maryland_Baltimore_City_Neighborhoods.zip")
Baltimore_Map.head(50)
Was able to partially correct the GeoPandas error from the previous presentation. Color code the unique districts of Baltimore.
To further extend the exploratory data analysis I used a pivot table to visualize the count of unique crimes in Baltimore.
crimes_count_date = df.pivot_table('CrimeCode', aggfunc=np.size, columns='Description', index = df["CrimeDateTime"], fill_value=0)
crimes_count_date.index = pd.DatetimeIndex(crimes_count_date.index)
plo = crimes_count_date.rolling(365).sum().plot(figsize=(30, 30), subplots=True, layout=(-1, 3), sharex=False, sharey=False)
I will be using the ARIMA model to generate my crime forecast and prediction.
Why ARIMA?
An ARIMA (Autoregressive integrated moving average) model is a statistical model that uses data points that are generated at unique and specific times. In our Baltimore Crime data set, the majority of the crimes logged are coupled with a DateTime occurrence. These points coupled with their district location, and crime offense can be used to essentially predict the future. When ordered into an ascending order a trend may be visualized via the ARIMA forecast plot. An ARIMA models use four components to describe the time series data. Trend( an upward or downward nature), Seasonality ( if a pattern within a repeating time period occurs frequently), Irregularity ( factors outside of the data set that may influence the results), and Cyclic ( movement of graphed data). An ARIMA model is comprised of 3 key variables that determine the models prediction: p, q, and d. p = the amount of auto regressive terms, d = amount of deltas for the graph to remain flat, and q =.the amount of forecast errors for the model.
The two plots above are of are a combination of the forecast and the actual plot of our time series data. From the Baltimore police department data provided, the ARIMA model forecasts an increase in crime as seen in the positive trend tail of the graph. The forecasted portion of the graph is shadowed in gray so that it can be easily distinguishable from the historic police department data.
Before we got to our model forecasting, I visualized the time series by two orders of differencing.
ARIMA Model Results
==============================================================================
Dep. Variable: D.Month No. Observations: 2372
Model: ARIMA(1, 1, 2) Log Likelihood -8079.028
Method: css-mle S.D. of innovations 7.291
Date: Thu, 22 Jul 2021 AIC 16168.055
Time: 00:03:39 BIC 16196.913
Sample: 1 HQIC 16178.560
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
const -0.0046 0.013 -0.365 0.715 -0.029 0.020
ar.L1.D.Month 0.3207 0.155 2.063 0.039 0.016 0.625
ma.L1.D.Month -1.0973 0.161 -6.829 0.000 -1.412 -0.782
ma.L2.D.Month 0.1544 0.145 1.062 0.288 -0.130 0.439
Roots
=============================================================================
Real Imaginary Modulus Frequency
-----------------------------------------------------------------------------
AR.1 3.1180 +0.0000j 3.1180 0.0000
MA.1 1.0735 +0.0000j 1.0735 0.0000
MA.2 6.0347 +0.0000j 6.0347 0.0000
-----------------------------------------------------------------------------
<-- Model Summary
<-- This table is the coefficients table. "Values under the coef column are the weights of the respective terms" (https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/)
Using a residuals.plot I was able to visualize the mean and uniform variance. The mean and uniform variance being close to 0 is a good indicator.
I then used model_fit.plot_predict to plot both the predicted plot as with the actual plot. If you look closely you may see a blue line. This line is the forecasted plot from our data. We will use this to attach to our actual data. The new piece, as seen above, is the forecasted data.
Standardized residual: The residual is floating around an average of 0
Histogram plus estimated density: Again this graph is centered around a distribution of 0.
Normal Q-Q: The distribution is not skewed because the blue dots are in line with the red lone
Correlogram: Also know as an ACF plot shows that the errors are correlated. This shows that there is no pattern in the data.
From our ARIMA Model I was able to determine the forecast of crime activity. Because our data set column CrimeDateTime is not a complete set for the present year (2021) ending on July, the data appears to indicate a drop in crime counts per month. This is not because crimes in the area are lessening but rather because when the data from the API was pulled, the year was still in progress. The ARIMA model will predict that crimes are in a downward direction but this is not the case. Something I can do in the future to test the datas accuracy would be to not test the model against future results, but rather compare it against crimes that have already occurred and see how accurate it is at predicting those values.
results.plot_predict(2250,2300)
Future Plans: Something I can do in the future to test the ARIMA models accuracy would be to not test the model against future results, but rather compare it against crimes that have already occurred and see how accurate it is at predicting those values. For example, graphing between the months Jun2021 - July 2021 that have already occurred, removing those dates from the DataFrame, and having the model predict what those levels of crime activity would be. This way I will have a known and "unknown" comparison to visualize the models accuracy.
https://youtu.be/lE-kRzSZloE
https://www.sallerlaw.com/baltimore-crime-statistics/
http://web.engr.oregonstate.edu/~tgd/publications/mlsd-ssspr.pdf
https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/
https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf
https://www.linkedin.com/in/monty-rahman-543397142/
https://github.com/RahmanMonty
https://github.com/RahmanMonty/Data606Capstone/tree/main