Introduction
San Francisco city is most known for being a technology hub and a Tourist destination. However, the crime in the city is also increasing at an alarming rate. The aim of our project is to predict top locations and time using category of crime,classify the category of crime that occurred, given the information of time and location . These results would be helpful for SF police to understand the underlying patterns of the crimes and aids in public safety by alerting the police to target specific locations at specific times. It helps in proper utilization of resources and in alleviating crimes.
Dataset
The data is Crime data set containing the records from 2001 to 2017. It has more than Two Million Records and has a total of 39 categories. The file can be retrieved from the file below.
Visualization
Tableau is used for visualizing the crime locations as shown in the image below.
Algorithms Used and Implementation
K-Means Clustering
We have implementing K-means clustering to group latitudes and longitudes into K number of clusters. K-means is a simple unsupervised machine learning algorithm that groups a data set into a user-specified number (k) of clusters. We have used elbow method for determining the no of clusters K. This is the plot of distortion vs clusters. We have chosen 40 clusters based on the elbow method.
Elbow Method
Sample output of K means. These cluster numbers are used instead of Latitude and Longitude to provide location information in Naive Bayes algorithm.
San Francisco Map with clusters from K-Means Clustering
Naive Bayes
Naive Bayes classification is implemented to classify the category of crime that occurred, given the information of time and location. The following are used as inputs which include Cluster numbers from the k means algorithm, day of week, time(hour), date(month). Naive Bayes algorithm prints accuracy and outputs predicted labels.
The below formula is used in Naive Bayes classification to predict category when hour, week day, month and cluster number are known.
We have developed a generative model of naive Bayes to predict the top 10 highest probable times and locations given a crime category. The below formula is used in Naive Bayes prediction when category is given as input, and the outputs top 10 probable clusters(Locations ) and hours for the occurrnce of the crime.
Below are the results shown for our generative model for the category of Robbery.
Output of Naive Bayes Algorithm to predict the top ten areas of crime and time using category as input
Challenges
Evaluation Metrics
NBAccuracy SparkMlLib Accuracy
Technologies Used
Python, Spark
Team Members and Responsibilities
Lavanya Ayila 800960067 Kmeans,NaiveBayes, ElbowMethod, DataPreprocessing, Final Report
Rudhra Simha Balankari 800962658 Kmeans,NaiveBayes, ElbowMethod, DataPreprocessing, Final Report
Likhith Chinnam 800986749 Kmeans,NaiveBayes, ElbowMethod, DataPreprocessing, Final Report