For the original categories, the most common crime type was Larceny, which made up 45.65% of the observations in our test dataset. To find a meaningful model, we would need to observe accuracy scores at least greater than 45.65%, the accuracy which can be attained by classifying all crimes as the most common category.
Multivariate Logistic Regression
We started off with baseline modelling with a Multivariate logistic model based on streetlight distance.
We observed that the prediction accuracy was less than the 45.65% test accuracy score based on predicting everything as most common category. Therefore, we can say that there is not much or no increase in our predictive power in categorizing crime type based on streetlight distance.
Similarly, we performed logistic modelling on Streetlight density. We also later included all predictors and in both cases, we observed similarly poor results as for street light distance.
kNN Model
We then tried out kNN Models, comparing results with multiple K values to find which performed best.
Based on the above prediction accuracy values, we can see that the best prediction accuracy on the test data set is for k = 50. Here we do not care much about the train accuracy values as we might end up over fitting the model.
Conclusion
Based on the above results, we can say that the baseline logistic and kNN models for the original categories using streetlights and all predictors did not yield meaningful predictions in both train and test data sets. We moved on and tried different algorithms, including ensemble methods and neural networks.
Models for New Categories