The model we settled on was a Random Forest Classifier with 17 estimators. After data filtering/cleaning, we were left with 597998 rows and 59 columns. We also had to one hot encode all of the state information to fit with the classifier. Shown below are the performance results of the best model we tested. We set aside 30% of the data for testing, resulting in a training dataset of about 420,000 rows.
The confusion matrix shows the number of true classifications along the diagonal, false positives in the top right, and false negatives in the top left. For the purposes of this problem, false positives are the most undesirable type of result because that would lead to incorrect policy decisions. False negatives are harmful, but less so because you can always send resources out later.
Shown above are the most important features used by our random forest classifier to predict the threat status of wildfires.
In the future, one of our primary goals would be to increase the diversity of data attributes in our dataset. The dataset we used was highly dependent on location data and discovery day (seasonal information), and did not really take into account the specific climate conditions for each fire. Joining this data source with others would allow for a more robust model that could better understand the key factors involved in fire proliferation. Another point of improvement would be to use cross validation to make sure our model is not overfit to the training data. We could also have used Grid Search to more systematically tune the hyperparameters and ensure the most optimal fit of the model.