Plot Data Collection
Species frequency maps were created using forest plot data collected from national inventories in the U.S. and Canada (Fig. 2).
In the United States, the National Forest Inventory (NFI) is managed by the United States Forest Service. The NFI consists of permanent sample plots spaced approximately every 6,000 acres across forested areas, with measurements taken every 5 to 10 years (United States Department of Agriculture, 2024). Each sampled tree is assessed for species, diameter, and basal area, providing a standardized dataset for forest monitoring.
Canada’s National Forest Inventory Program operates under a decentralized system, where each province conducts its own forest measurements (Government of British Columbia, 2023). These provincial programs follow similar methodologies, collecting species and basal area measurements and revisiting sample plots at regular intervals. Canadian plot data was also obtained from national ecological inventories (NRCAN, 2021).
Climate Data Collection
Fig 4. 5km ClimateNA data displaying mean annual temperature (MAT) across North America.
Climate data was obtained using ClimateNA software, which aggregates 1951–1980 annual climate averages into interpolated climate data points across North America (Fig. 3). When climate data is projected at 5km resolution across North America, it is the target dataset for predictive modeling, enabling the generation of species frequency predictions across the continent.
To train predictive models, sample climate points were generated based on plot locations, or subsets of the continent, this climate data was used as part of the training dataset.
For both training and target datasets, eroneous climate value were removed.
A total of 16 climate variables were incorporated to train species distribution models.
These variables include:
Directly calculated annual variables:
MAT: mean annual temperature (°C)
MWMT: mean warmest month temperature (°C)
MCMT: mean coldest month temperature (°C)
TD: temperature difference between MWMT and MCMT, or continentality (°C)
MAP: mean annual precipitation (mm)
AHM: annual heat-moisture index (MAT+10)/(MAP/1000))
SHM: summer heat-moisture index ((MWMT)/(MSP/1000))
Derived annual variables:
DD<0 (or DD_0): degree-days below 0°C, chilling degree-days
DD>5 (or DD5): degree-days above 5°C, growing degree-days
PAS: precipitation as snow (mm). For individual years, it covers the period between August in the previous year and July in the current year
EMT: extreme minimum temperature over 30 years (°C)
EXT : extreme maximum temperature over 30 years (°C)
CMD: Hargreaves climatic moisture deficit (mm)
RH: mean annual relative humidity (%)
CMI: Hogg’s climate moisture index (mm)
Topographic Variable Preparation
Fig 5. Compound topographic index (CTI) map of North America.
Rasters of topographic were created using digitial elevation models (DEMs). Values were extracted for use in both target and training datasets. Topographic variables were created at 1km resolution. To better represent topographic variability at smaller spatial scales.
14 topographic variables were used including:
Compound topographic index (CTI)
Elevation
Topographic position index (TPI)
Distance to lakes and distance to oceans
Modelling
Fig 6. Modelling overview for this project.
The modelling for this project was split into two distinct models, the first (deep neural network 1) is a land cover model used to add additional variables to the species model, and reclassify anthropogenically modified land cover, this model was created by my colleague Nicholas Boyce.
The second model was split into two related models, the first of which (deep neural network 2a) was a presence/absence model (probability of species being present), and the second (deep neural network 2b) predicts species frequencies using only inventory plots where the species was observed as present.
Land Cover Modelling with Deep Neural Networks:
As additional predictor variables, I used the output from the predictive land cover model provided by Boyce (2025, unpublished). Briefly, MODIS land cover classification data was used as a dependent class variable to train a deep neural network, except agriculture and urban classes were omitted from the training process. Predictor variables used included 14 climate and 12 topographic variables, and the model was then applied to the same variables for the same area, replacing agriculture and urban classes with the landcover class (other than agriculture and urban) that had the highest probability according to the neural network. The probability of a specific land cover type, e.g. the probability of deciduous forest coverer, was then used as additional predictor variables for individual tree species models. This model output land cover proabilities for 19 land cover classifications, which were used as 19 additional predictor variables, the removal of built-up and agricultural land allows the species models to act as habitat suitability maps, reflecting where a species' range may have been cut short due to anthropogenic activity.
Fig 7. MODIS remote sensed land cover classification (left) and modelled land cover predicitions displaying the land cover class with the highest probability of occurence (right).
Species Frequency Modelling with Deep Neural Networks:
To predict species frequency across North America, we implemented a two-part zero-inflated modeling framework (DNN 2a and 2b), commonly referred to as a hurdle model (Zuur et al., 2009, Martin et al. 2005). This modelling approach addresses the prevalence of 0's in species frequency data.
The first model (DNN 2a) is a binary classifier: This model tells us whether a species is likely to occur or not, it gives us the probability of occurence.
The second model (DNN 2b) predicts species frequency, but is trained on only plots where the species has observed presence.
The results were then combined through simple multiplication, which has given the best reflection of true species distributions.
Sampled climate data (from ClimateNA), topographic data (from DEMs), and land cover probabilities were paired with aggregated forest plot data (from U.S. and Canadian National Forest Inventories) to train the Deep Neural Networks. It then predicted species frequencies based on the trainging data onto interpolated ClimateNA data at a 5km resolution across North America.
Training data: Aggregated forest inventory sample plots with matched predictor variables.
Prediction: The model was run with the known species frequency data from the plots and then predicted onto the 5km spaced points covering North America; giving continent wide frequency predictions.
Output: Predicted species frequencies were then rasterized and maps of species distributions and frequencies were created for the most prevalent North American species.
These maps indicate where species are most likely to occur under current conditions, helping validate forest inventories, inform seed selection, and better understand habitat suitability.