We dropped some of the irrelevant columns such as the board game url, image_url and so, and only included those we think are meaningful in our model. The included features are:
Since we have over 100 features in the data, it may not make the most sense to include all the features in the random forest, especially if we are doing cross validation. So I first ran a model with all the features and plotted out the features by their importance.
We used randomForest function in randomForest library to train our models.
Feature Importance Plot
RMSE: 0.3815
Zoom in to Top Features
RMSE: 0.4022
Cross Validation to Choose the Best mtry Parameter (ntree = 100)
Best RMSE: 0.3711
Cross Validation to Choose the Best ntree Parameter (mtry = 30)
One important parameter in random forest is the number of trees to grow. We can use cross validation to choose the best ntree.
Best RMSE: 0.3725
From the above, we can choose our best model to have ntree = 550 and mtry = 30. The predicted values and the true average ratings are as following: