project_nypd_crime

Investigating Crime and Predictive Policing in NYC

Hi, and welcome to a walkthrough of an extensive term project I created as part of a graduate course for my Master's degree. The goal behind this project, as a course deliverable, was to exercise machine learning, data science, or general analytics skills on a crime-related dataset. I took inspiration from the various techniques highlighted in the course material on how to data mine or go about "modeling" crime or similar data for meaningful law enforcement. I have also written an essay that covers topics in policing, data analytics, ethics, etc., as part of this course.

The reason why I stress it as "extensive" is because I was fortunate to find and work with 4 large, real-world, rich datasets from the NYPD, which I will discuss in a moment. For each of these, I set out to achieve a specific visualization or perform a predictive modeling task, backed by intuitions that I developed after spending quite some time staring at the datasets themselves. And because it is based on my belief, it is certainly possible that the data can be used to achieve a different set of tasks or cross-apply what I did. I would love to read your suggestions if you have any.

All the code and additional files can be found at this GitHub repo. Here is an outline on the write-up and a summary of tasks for each of the 4 datasets:

1. Quick Introduction - Some NYPD history and Data

2. Arrests (EDA, Querying, Folium Viz)

3. Shooting Incidents (Clustering, Feature Engineering, Classification)

4. Complaints (Deep Learning via ConvLSTM2D)

5. Criminal Court Summons (EDA, Pivot tables, Bar chart race)

6. References

Quick Introduction - Some NYPD history and Data

New York City is the most populous city in the United States. It has five boroughs or constituencies: Queens, Brooklyn, Manhattan, the Bronx, and Staten Island. Both crime and policing share a deep history with NYC, with the NYPD being the oldest serving police department in the country. From 1970 to 1996, Jack Maple, a New York City deputy police commissioner for "crime control strategies," devised a revolutionary new concept to bring down crime. When he was tasked with reducing robberies, Maple identified that the most violent of them were centered in subways. He pinpointed subway locations that were robbery hotspots via a large map on the wall, with the intention to track or discover any underlying patterns. He named them as "Charts of the Future." By placing officers at these subway locations, Maple was able to identify any shifts in robbery activity to other locations so that police officers could be dispatched in a "rapid response." By using his methods, such crime decreased dramatically. This notion of visualizing and "detecting" plausible crime patterns was the birth of COMPSTAT (short for COMPuter STATistics, now COMParison STATistics), a program coined by Maple himself. It was a rebranding of "Charts of the Future" with police officers across NYC collating and loading information on criminal activity in a computer database. It turned out to be a massive success in bringing down crime in an otherwise violent and mafia-ridden NYC. Today, COMPSTAT is not an independent technology or software in itself but a program or a series of underlying tools and mapping systems that mines recorded criminal data, used by different police departments in North America. What started out in NYC as a method to compile and create a statistical summary of weekly criminal activity turned out to be a crime quantification or computerization program for law enforcement (there is even a spin-off called TrafficStat that studies traffic violations). Jack Maple died in 2001 and is informally known as "The Crime Fighter" (a book which he also wrote in 2000).

But why the emphasis on COMPSTAT? It's because a refurbished version of it is open to the public via an interactive online portal. NYPD's history with data collection paved the way for the department to make all of its criminal statistics data public as well, either through reports, dashboards, or datasets. For this project, I was able to obtain four different incident-level citywide datasets from the NYPD's stats page: https://www1.nyc.gov/site/nypd/stats/stats.page, available either as "Historic" (the entire data across many years updated annually) or "Year to Date" (updated quarterly or from the current year through the most recent full quarter).

The datasets are: Arrests, Shooting Incidents, Complaints & Criminal Court Summons, and I went with their historic versions, which start from 2006 through the end of the previous calendar year.

Arrests

Each record in the "Arrests" dataset represents an arrest in NYC by the NYPD and includes information about the type of crime, the location, and the time of enforcement. As the NYPD puts it, "this data can be used by the public to explore the nature of police enforcement activity," which is precisely what we are going to do.

Some details: The dataset size is 5.15 million rows by 19 columns, and we know that each row represents an arrest. That's roughly 5 million arrests with unique attributes such as the nature of the offense for the arrest, the date and time of the arrest, the internal classification code the NYPD uses to represent that offense, and other information.

Objective: Is it possible to quickly localize and see where arrests are made based on custom factors? Answer: Yes.

I started by doing some EDA, finding out which month the arrests occurred the most, which happens to be March. Similarly, I determined which day of the week was most attractive for arrests, which is Wednesday. Another example could be the distribution of arrests among the boroughs. Manhattan and Brooklyn are the top two in this regard. All of these visualizations are helpful, but what if I wanted to see results like these per borough, per neighborhood, per offense type, per age group, per precinct, or a combination of these, etc.? I would need to filter the data based on SQL-like queries, which is easy, but I would need some additional help to reflect these results on an NYC map, and that is where folium comes in.

Folium is a library that allows you to manipulate data in Python and then visualize it on a JavaScript-backed map. The following heatmap was generated with folium when I queried the dataset to show all the criminal trespasses in Brooklyn. The folium output object can be viewed within an ipynb notebook or saved externally as an HTML file. In both cases, it is fully interactive. When zooming in on the orange clusters, you can see which streets in Brooklyn are more susceptible to this particular offense. It seems that as you move towards the bay, the intensity decreases.

Shooting Incidents

The dataset comprises approximately 23,000 rows with 19 columns, and each record represents a shooting incident. Each shooting incident includes information about the event, the location, and the time of occurrence, as well as details on the demographics of the suspect and victim.

Some details:

Most shootings did not result in victim deaths; in fact, only around 4,000 of the ~23k incidents resulted in murder or death. This is denoted by a true or false attribute called STATISTICAL_MURDER_FLAG. Law enforcement agencies and emergency responders have limited resources in cities with high crime rates. If certain patterns or indicators are found to be associated with a higher likelihood of victim death, it can help identify early warning signs, enabling authorities to take proactive steps in high-risk situations.
So, my objective here was, given the knowledge about a shooting taking place in NYC, can you estimate the risk of that shooting incident resulting in victim death and, through modeling, find the key attributes responsible for it? Answer: Surprisingly, yes. More details below.

Procedure applied:

Geocode latitude and longitude coordinate attributes into zones via k-means clustering to see if there is valuable spatial information that could help improve classification. For example, if certain areas in different boroughs have distinct characteristics that are relevant. Feeding in raw numerical coordinates to a model can throw off predictions. Created zones instead, some insight here.
Split shooting date and timestamp into different variables so that which month of the year, which week of the month, which day of the month, which day of the week, and when during the day can be fed in. Ideally, it is advisable not to include year information, and I didn't, because with any type of predictive modeling, you want the model to advise you in the future, and those future years are not in the dataset.
Employ a classification algorithm that automatically applies smart encoding for categorical features - LightGBM or XGBoost. I went with LightGBM.
No treatment for imbalance, only NaN drops and the feature engineering above. There is no denying that the murder attribute is extremely imbalanced. But these are real incidents, and I was inclined to use the dataset as it is with no sampling and acknowledge any bias or skews via class weights and different evaluation metrics.
RESULT: Test classification accuracy of 78% and a surprisingly decent F1-score of 63% right off the bat without any hyperparameter tuning or probability threshold checks. The following figure is the LightGBM feature importance(s) in order. Which precinct the shooting takes place turns out to be the most important criterion. This certainly hints that governance in a few precinct locations is under stress. Also surprising was the inclusion of the time attributes among the top 5. At first, I was dubious of this ranking, so I ran a classification test without them, and sure enough, the performance went down. It turns out there is a correlation between victim deaths from shootings and when they happened. Plausible reasons? Time delay to reach hospitals? Time delay in police response to the crime location? Do share your comments.

Complaints

"Complaints" is by far the largest and most extensively investigated dataset among all the data that the NYPD has to offer. It encompasses all valid felony, misdemeanor, and violation crimes reported to the NYPD since 2006. The dataset has a substantial size of 7.38 million rows by 35 columns, with each row representing a complaint. That's a significant number of complaints! Some examples of its 35 unique attributes include an indicator of whether the crime was successfully completed, attempted but failed, or was interrupted prematurely, the level of offense, offense description, premise description, transit district, and the ages of both the victim and suspect, among others. My objective here was to embark on a unique but not entirely unfamiliar endeavor. I wanted to explore the possibility of predicting where and when future complaints might occur by studying the complaints that have occurred at specific locations and during specific time periods. Human choices are often influenced by the environment, and the environment can take on multiple definitions. With this attempt, I set out to investigate whether that environment could be defined as time and space.

For example, if there is a particular time of day when a specific grocery store in NYC is susceptible to grand larceny attacks, it would be invaluable to identify that specific time-location pair. This knowledge could then be used to prevent such robberies and understand why that pairing occurs in the first place. An in-depth crash prediction tutorial served as a significant source of inspiration for this endeavor, and I encourage you to read it. This tutorial explains the deep learning intuition behind ConvLSTM2D networks and how they can be used to model time and space, which aligns with the problem I've defined here.

Objective: Predict when and where complaints are likely to happen

Procedure applied:

Employ deep learning - ConvLSTM2D; which is a neural network architecture that utilizes convolutions from a CNN passed through a Long Short-Term Memory network. This combination theoretically enables the extraction of spatial-temporal correlations.
Discretized latitude and longitude coordinate attributes by creating an n x n grid.
Calculated all the days from January 1, 2006, until the final date in the dataset, marking when a complaint occurred or didn't occur as True or False events.
Mapped the above events with the coordinate grid to create a multi-dimensional model input.

RESULT: Achieved a test classification accuracy of 93%. The test data's ground truth and predictions are shown below, with each yellow dot representing a complaint that occurred at a specific time and place. The visualizations would be even clearer when observed in 3D, where the y-axis represents different time periods. Such a result demonstrates the ability to learn past spatial-temporal patterns for predicting the future presence of complaints. And more so than that, the accuracy tells us there is in fact a strong relationship between location and time in estimating complaints.

Criminal Court Summons

The "Criminal Court Summons" dataset contains a list of every recorded criminal summons issued in New York. The dataset comprises approximately 5 million rows and 17 attributes, with each row representing a criminal court summon. This means that someone committed an offense by violating a certain law, leading them to be involved in the NYC criminal court for due process. Among the 17 attributes, several are related to legal aspects, including the law section number, law description, summons category (a general description of the violation), and jurisdiction code (a code indicating which jurisdiction was responsible for the issued violation).

My objective was to determine which laws were violated the most over time in NYC from 2006 to 2020. While static charts can be created, they lack a dynamic component to observe the violations in motion over the years. To address this, I focused only on the top 20 law section numbers by frequency and created a pivot table from the entire dataset. The index would be the year, which I parsed from the incident timestamp, and the columns would represent the law section numbers. The values in the table would indicate the number of times the law was violated in a particular year, making it easy to calculate. To create the desired visualization, I used a Python library called "bar chart race," which works exceptionally well with pivot tables. The result was an engaging animation, which you might find familiar if you have used Reddit.

RESULTS: The most frequent violations occurred under laws 10-125 and 9999. It's important to note that the inclusion of four nines is not an actual law but rather a number in the dataset used to denote various types of traffic violations and indecent acts, such as public urination. Notably, there was a sharp decline in violations under law 10-125 starting in 2016. This is because in 2016, NYC decriminalized public alcohol consumption, resulting in fines instead of court appearances. Since 2016, laws 221.05 and CFR 49 have taken the top spots. Law 221.05 pertains to the large-scale illegal possession of marijuana, while CFR 49 is related to the transport of hazardous materials, which, in the context of the NYPD, encompasses drugs, flammables, and dangerous weapons.