Table 1. Raw US plot data. With columns representing individual tree codes (Tree_CN), plot codes (PLT_CN), state codes (STATECD), species codes (SPCD), diameter measures (DIA), basal area (BA), year of measurement, coordinates, and Ecosystem codes.
Plot data was first cleaned and aggregated by plot, with species representing variables (columns) for each plot. Species values were then scaled so that total plot basal area or frequency sums up to the backfilled probability of forested land cover. This ensures that frequency values reflect forested landcover, rather than the percentage of species basal area of a particular plot.
Fig 8. Plots with observations of Douglas fir, historical range from Little (1971), and expanded range. Red plots are those where observations of species have been set to 0.
Range maps from Little (1971) were then rasterized and expanded by 200km (Fig 6), plots with species observations falling outside of the expanded range were set to 0 as they were likely either misidentifications or out of range plantings.
Fig 9. Introduced pseudo-plots in unforested areas.
To reduce spatial redundancy and computational load, I aggregated my forest plot dataset of approximately 800,000 forest plots to a coarser resolution of around 30,000 data points for model training. Latitude and longitude were binned into 0.5° intervals, and elevation was grouped in 250-meter bands. Within each resulting spatial-elevational bin and ecozone (Level 4 Ecosystem) plot-level variables were averaged to produce a single representative observation per group.
100,000 pseudo-plots were introduced in unforested areas. Unforested areas were determined by displaying no forest or shrub cover in remote sensed data, and having under a 5% probability of trees or shrubs. These plots were introduced to reduce bias toward forested regions and to introduce geo-climatic conditions unsuitable for forests, ensuring the model doesn't over predict into unforested areas. areas.
Fig 10. Nonscaled and non-transformed predictor variables for species distribution model
For DNNs predictor variables need to be approximately normal and scaled. Many climate variables are not naturally normal, and variable units are vastly different (Fig. 8). Before running DNN models, variables needed to be transformed and scaled.
DNN input data needs to be scaled because neurons may be overrun by variables with large values, and overlook those with smaller ranges (i.e. MAT from -40 - 30, MAP from 0-10,000)
Like mean annual precipitation (MAP) (Fig. 9) many climate variables are zero-bounded and right skewed. By performing transformations predictor variables were normalized. All predictor columns were then scaled by Formula 1 (where z = scaled values, x = original value, μ = mean, σ = standard deviation) to ensure a mean of 0 and standard deviation of 1.
Formula 1) Scaling formula