Figure 1: Models Classification Report
Precision = True Positive / (True Positive + False Positive)
Recall = True Positive / (True Positive + False Negative)
True Positive = Actual is Positive, Prediction is Positive
False Positive = Actual is Negative, Prediction is Positive
False Negative = Actual is Positive, Prediction is Negative
All models performed better at predicting non-severe or no drought in comparison to the task of predicting severe drought, with F1 for predicting non-severe drought in the 0.97-0.98 scale with high and balanced precision and recall. F1 for the positive or severe drought categories ranged from 0.66 for baseline to 0.82 for the LSTM model, with greater imbalance between precision and recall for all other models.
While precision was relatively high for all models, recall was lower for all models. High precision indicates that the model is accurately predicting true positives or actual severe drought and making few false positive predictions (inaccurately predicting severe droughts when there is none)
Lower recall indicates the model is making false negative predictions, or that the model is incorrectly predicting non-severe drought scores when the actual drought scores are in the severe category
The imbalance in precision and recall is likely due to the inherent imbalance in the dataset
Severe drought is rare, making up roughly 10.5% of the test dataset. In this context, recall, or sensitivity, is likely more important. It may be more important for decision-makers that a drought prediction model can detect a severe drought in advance, as opposed to underpredicting severe drought, for more conservative decision making.
XGBoost
Figure 2: XGBoost Horizon & Window Testing Results
LSTM
Figure 3: LSTM Horizon & Window Testing Results
A shorter horizon led to better results, with macro F1 scores decreasing as the horizon increased from predicting four weeks out to 16 weeks after the data window
A variety of window sizes produced similar results (within 0.01 difference) in macro F1 scores, suggesting that simply increasing window range from 12 to 24 weeks of data or from 12 to 52 weeks would not substantially improve performance
A 24 week window for the LSTM model appeared to perform best at each horizon, with 30 weeks performing second best for all horizons except 16 weeks
In some cases for the XGBoost, even using only 12 weeks of data would lead to equally good performance, while using 52 weeks did not necessarily lead to the best results
Figure 4: Comparison of XGBoost Results By Score
Note: No 5 scores in test dataset. *Negative predictions made.
The rate of underprediction for incorrect predictions increased with the actual drought score, while accuracy decreased.
Scores were converted to 0-5 integer categories using a 0.5 threshold for each integer
For example, an actual score of 0.5 was rounded up to an integer value of one
We then compared whether the actual and predicted score categories matched, or if the prediction was correct. If the prediction was incorrect, then we then evaluated whether the predicted score was below or above the actual score
For an actual score of one, when the prediction was incorrect, it predicted a smaller score 74.31% of the time
In comparison, for an actual score of four, when the prediction was incorrect, the model predicted a smaller score 100% of the time
Figure 5: Tabular View of Avg. Actual vs. Predicted Scores
Figure 6: Avg. Discrepancy Between Actual vs. Predicted Scores
The LSTM model produced an average score discrepancy of 0.27 drought categories over an example window using the last 12 weeks of the dataset
The model predictions were very accurate for the first seven weeks, deviating by less than 0.1 in score
Beginning in week 8, the difference between predicted and actual average scores grew substantially, particularly in weeks 10-12, which we expected as per our findings from the horizon and window testing above