This page describes the preliminary data exploration before analysis, including checking distributions, transformations, and removing outliers or other erroneous data.
In some cases, Climate NA will generate erroneous values of -9999 if data is unavailable for an input location - these need to be removed before any analysis is done or they will significantly skew results. The climate variable MAR (Mean Annual solar Radiation) was generated with -9999 in all observations, so it was removed from analysis.
Any 0 values of land cover (land cover type = NA) will not be useful for classification and may confuse the RF algorithm, so these were removed before continuing with analysis.
Random Forest is non-parametric, so normal distributions of variables are not required. However, we may want to check the distribution of classes to see if some classes are very rare or if there are exceedingly dominant classes (Figure S1).
We can see a few rare classes, such as tropical or sub-tropical broadleaf evergreen and tropical or sub-tropical grassland. In order to ensure these classes are represented in the training data and preserve the proportions of classes, we use a stratified sample to generate our random forest training data.
In order to perform the DNN training on our data, we first remove erroneous values as outlined in the preparation for Random Forest above, to prevent -9999 values from skewing results. We then ensure that the variables are scaled and normalized so that variables with larger values do not overshadow those with smaller values.
Shown here are the distributions of the annual climate data generated by ClimateNA for our 5km grid point sample. Many of the distributions appear non-normal and scale varies greatly between different variables.
Shown here are the distributions of the ClimateNA annual variables after scaling. We can see that although the variables now share the same scale and are roughly centered around 0, many of the distributions appear significantly non-normal.
Shown here are the distributions of the ClimateNA variables after applying transformations to correct non-normal distributions and scaling. We can now see that most distributions are approximately normal and centered around zero. The data is ready for DNN training.
model = keras_model_sequential() %>%
layer_dense(units=4096, activation='relu', input_shape=c(25)) %>% #adjust for number of predictors
layer_dense(units=2048, activation='relu')%>%
layer_dense(units=1024, activation='relu')%>%
layer_dense(units=512, activation='relu') %>%
layer_dense(units=256, activation='relu') %>%
layer_dense(units=128, activation='relu') %>%
layer_dense(units=64, activation='relu') %>%
layer_dense(units=32, activation='relu') %>%
layer_dense(units=19, activation='softmax') #adjust for number of classes
model %>% compile(
loss = 'categorical_crossentropy',
optimizer = optimizer_adam(),
metrics = c('accuracy'))
history = model %>% fit(train_x, train_y,
epochs=25, batch_size=64,
validation_split=0.2)