Results & Discussion

Table 1.  A comparison of model performance on the historical training data (1961-1990)

Model Performance

The random forest model had the highest accuracy on the independent and balanced test set at 65%; however the high 100% accuracy on the training data suggest overfitting (table 1). The Neural network also is overfitting the training data (accuracy of 91%) and has a lower accuracy than the random forest on the balanced tests set. The LDA model is not overfitting the training dataset however it is also struggling to differentiate all the classes and has the lowest accuracy on the independent test set. Overall the random forest model preformed the best despite possible overfitting. 


Figure 8.  A pie chart breaking down the model's ecosystem predictions for all of North America (inc. stands for incorrect) 

Model error overlap

About one fifth of North America was misclassified by all three models, indicating that theses ecosystems may not be differentiable by the balanced climate training data (figure 8). Wile all three models correctly classified about a third of north America; Indicating that these ecosystems were well described by the balanced climate training data. Despite being the two most accurate models the random forest model and the neural network had the least overlap in their misclassified values at 3.5 %; indicating different metric are being used to differentiate the ecosystems.


Accuracy per Ecosystem

The random forest model has a left ward skewed distribution of ecosystem percent incorrectly classified (fig. 10 a); Indicating that more ecosystems have an accuracy greater than 50% than not. The Neural network has the same distribution but is less skewed with more ecosystems having a lower accuracy of around 50% (fig 9 b).   Both the neural network and the random forest only have a high misclassification percent (about 70% and above, the right tails of Fig 10 a and b) for less than 50 ecosystems. While, the LDA model has a more normal distribution and for most of the ecosystems and is misclassifying 40 to 60 % of the points (fig 9 c)

Model Certainty

Model certainty must be included when interpreting envelope models because when providing climatic habitat matches understanding the quality if the match can help allocate resources. Essentially if one area has a higher probability of being the correct match that area's climate habitat match is closer the the original ecosystem climate and is a good candidate for ecosystem migration. 

For all three models the average probability of the misclassified points was lower indicating that the models were aware that the classification could be incorrect (fig 10). Looking a the distributions the RF model had the most dramatic distribution juxtaposition between the correct and misclassified calculations (fig 10 a). This shows promise for developing a probability cutoff where the recommendations could come with a warning that the climate could be novel or the model is unable to provide an ecosystem suggestion with high probability and this location is not a good candidate for assisted migration. 

Feature Importance

The random forest model shows the most promise. Additionally the importance of features for random forest model predictions can be calculated unlike with neural networks. This makes random forest models more imperturbable. 

Continentality is the most important feature for differentiating the ecosystems followed by the precipitation variables (fig 11). The future prediction for precipitation is uncertain because precipitation is a hard climatic variable to model. Having the precipitation values have such high importance could compromise the model's predictions. A sensitivity analysis of the precipitation values would help determine the models vulnerability to incorrect precipitation climate projections. 

The fourth and fifth most important features are the dryness indices which are correlated with precipitation (fig11). This really emphasizes how the precipitation is really driving the class splits. This could be because the temperature features are all so highly correlated. 

Figure 11. Feature importance of the random forest model

Overall, the random forest model shows the most promise with the highest accuracy on the balanced independent test set, the highest accuracy per ecosystem, the most accurate certainty and the most interpretability. 

Next steps

Compare model prediction to empirical calculations of climatic distance in climate projections. Add in tree plot data so along with ecosystem predictions and prediction confidence the model can also provide historical tree growth data of the climatic habitat.