For the new categories, the most common crime type was Verbal Disputes, which made up 59.06% of the observations in our test dataset. To find a meaningful model, we would need to observe accuracy scores at least greater than 59.06%, the accuracy which can be attained by classifying all crimes as the most common category.
Multivariate Logistic Regression
We started off with baseline modelling with a Multivariate logistic model based on streetlight distance.
We observed that the prediction accuracy was equal to the 59.06% test accuracy score based on predicting everything as most common category. Therefore, we can say that there is not much or no increase in our predictive power in categorizing crime type based on streetlight distance.
Similarly, we performed logistic modelling on Streetlight density. We also later included all predictors and in both cases, we observed similarly poor results as for street light distance.
kNN Model
We then tried out kNN Models, comparing results with multiple K values to find which performed best.
Based on the above prediction accuracy values, we can see that the best prediction accuracy on the test data set is for k = 10. Here we do not care much about the train accuracy values as we might end up over fitting the model.
Conclusion
Based on the above results, we can say that the baseline logistic model for the original categories using streetlights and all predictors did not yield meaningful predictions in both train and test data sets. However, we can see a slight improvement in prediction accuracy (62.298&) if we use all predictors. We also observed that the prediction accuracy is improved even further using the kNN model, with a best accuracy score of 66.204 for k= 10.
We moved on and tried different algorithms, including ensemble methods and neural networks.
Models for Original Categories