Decision Tree Classifier
In this section, we start off with a decision tree classifier and perform 5-fold cross validation for all predictors to get the cross validation score, prediction and accuracy scores. We created a generalized function below for repeatability:
We performed 5-fold cross validation up to a max tree depth of 20 to find the optimal depth for prediction accuracy. We then plotted the results:
Based on above plot, we can observe that the test accuracy score is highest (around 50.4%) when the tree depth is 12. Therefore, we will use this tree depth for other ensemble methods going forward.
Bagging
Definition: Bagging is a combination of bootstrapping and aggregation. In bagging, we use bootstrap re-sampling to create different training data sets. This way each training will give us a different tree.
Since we have many trees that we will average over for prediction, we can choose a large max_depth
and not worry about overfitting, as we will rely on the law of large numbers to shrink this large variance, low bias approach for each individual tree.
We performed bagging on all predictors with the best depth of 12, which we got from the decision tree classifier. We generalized and created a function so that it could be reused.
In this bagger function, we create random samples from our train data set multiple times based on the input parameter, and create decision tree models for each sample. We save and return the predictions for each model.
We ran the bagger function 55 times and got 55 models and used mode to get the most frequently predicted value for each model. We calculated the train and test accuracies for all predictors. We observed a test accuracy of 51.252%.
We can observe that the test accuracy has increased but only marginally. However, we can take a look at the variable importance and see what features the model considered most significant.
Based on the above chart, we can see that police station distance is the most important predictor to predict crime type in the bagging model, followed by property average and community center distance.
Let us move on and try Random Forest and see if that helps increase our predictive power or selects different variables as most important.
Random Forest
Definition: In Random Forest, we will build each tree by splitting on "random" subset of predictors at each split (hence, each is a 'random tree'). This can't be done in with just one predictor, but with more predictors we can choose what predictors to split on randomly and how many to do this on. Then we combine many 'random trees' together by averaging their predictions, and this gets us a forest of random trees: a random forest.
In this section, we use a Random Forest classifier with estimator count of 55, similar to what we did in bagging. We can observe that the test accuracy is 51.4%, similar to that of bagging with only very slight improvement.
Based on the above chart, we can see that police station distance is still chosen as the most important predictor of crime type, followed by property average and college and university distance.
Conclusion
Based on the ensemble methods that we have performed above, we can say that there is still not much of a significant change in our predictive power on both train and test data sets for original categories that we selected. The best prediction accuracy for test data set was achieved by Random Forest classifier model with accuracy of 51.4%.
Finally, we attempted to improve our predictions by building neural networks.
Models for New Categories