San Fransisco Crime Data Classification

Introduction

San Francisco city is most known for being a technology hub and a Tourist destination. However, the crime in the city is also increasing at an alarming rate. The aim of our project is to predict top locations and time using category of crime,classify the category of crime that occurred, given the information of time and location . These results would be helpful for SF police to understand the underlying patterns of the crimes and aids in public safety by alerting the police to target specific locations at specific times. It helps in proper utilization of resources and in alleviating crimes.

Dataset

The data is Crime data set containing the records from 2001 to 2017. It has more than Two Million Records and has a total of 39 categories. The file can be retrieved from the file below.

Visualization

Tableau is used for visualizing the crime locations as shown in the image below.

Algorithms Used and Implementation

K-Means Clustering

We have implementing K-means clustering to group latitudes and longitudes into K number of clusters. K-means is a simple unsupervised machine learning algorithm that groups a data set into a user-specified number (k) of clusters. We have used elbow method for determining the no of clusters K. This is the plot of distortion vs clusters. We have chosen 40 clusters based on the elbow method.

Elbow Method

Sample output of K means. These cluster numbers are used instead of Latitude and Longitude to provide location information in Naive Bayes algorithm.

San Francisco Map with clusters from K-Means Clustering

Naive Bayes

Naive Bayes classification is implemented to classify the category of crime that occurred, given the information of time and location. The following are used as inputs which include Cluster numbers from the k means algorithm, day of week, time(hour), date(month). Naive Bayes algorithm prints accuracy and outputs predicted labels.

The below formula is used in Naive Bayes classification to predict category when hour, week day, month and cluster number are known.

We have developed a generative model of naive Bayes to predict the top 10 highest probable times and locations given a crime category. The below formula is used in Naive Bayes prediction when category is given as input, and the outputs top 10 probable clusters(Locations ) and hours for the occurrnce of the crime.

Below are the results shown for our generative model for the category of Robbery.

Output of Naive Bayes Algorithm to predict the top ten areas of crime and time using category as input

Challenges

Handling large csv files.
Pre Processing of Data which included the delimiter ',' as part of the input.

Evaluation Metrics

NBAccuracy SparkMlLib Accuracy

Technologies Used

Python, Spark

Team Members and Responsibilities

Lavanya Ayila 800960067 Kmeans,NaiveBayes, ElbowMethod, DataPreprocessing, Final Report

Rudhra Simha Balankari 800962658 Kmeans,NaiveBayes, ElbowMethod, DataPreprocessing, Final Report

Likhith Chinnam 800986749 Kmeans,NaiveBayes, ElbowMethod, DataPreprocessing, Final Report

Report abuse