Data

Raw Data

After removing NA values represented as -9999 the date set ended up being around 890,000 rows each representing a 5km squared point in North America with 12 numeric climate features with different units and the categorical target variable as shown in the summary data table in figure 4.  Climate data is usually highly correlated and this climate data is no exception however we did attempt to choose less correlated but still biologically relevant features based on previous literature.  As would be expected MAT is highly correlated with MWNT MCNT NFFD DD5 and bFFp eFFP, the two precipitation variables are highly correlated with each other but not the other features the same for the dryness indices. 

Figure 4. In corresponding order the climate variables acronyms represent: mean annual temperature (C), mean warmest month temperature (C), mean coldest month temperature (C), Continentality (difference between mean January and mean July temperature) (C), mean annual precipitation (mm), growing season precipitation (May to September) (mm), the number of frost free days (days), the number of growing degree days above 5 degrees C. (days), Dryness indices: annual climate-moisture index (unitless), Dryness indices summer climate-moisture index (unitless), beginning of frost free period (date of day 1-365), end of frost free period (date of day 1-365) a). The data used to train the models before any transformations. b). A heat map of the Pearson correlation of all the climate variables. Red is a high positive correlation blue is a high negative correlation and lighter colored cells have little correlation.

Data processing

The data needed to be normalized for LDA, and PCA, so that the averages per ecosystem were more meaning full when calculating climactic distances. Figure 4 bellow is a box plot of the data before and after normalization.


Figure 5. Both are box plots of the climate variables after scaling. a). The raw data scaled b).  The data scaled after log transforming MAP and MSP, and taking he square root of CMI_sm and CMI

Multivariate statistics

The PCA of the ecosystem averages shows that the ecosystem classifications and averages make sense and have a real climatic implications (figure 6). the LDA shows what variables impact the clustering the most. 

Balancing the data

Alaska and Mexico have fewer points due to how large the ecosystems are. this could result in less arcuate predictions as show in figure 7 bellow.

Figure 7. The points randomly selected in each ecosystem to create a balanced a). test dataset and b). training dataset.