Data Exploration

Overview

Preparing Data for Random Forest

Preparing ClimateNA data for DNN models

Boxplots of Raw ClimateNA Data

Boxplots of Scaled ClimateNA Data

Boxplots of Transformed and Scaled ClimateNA Data

Appendix A: Full DNN model architecture

Overview

This page describes the preliminary data exploration before analysis, including checking distributions, transformations, and removing outliers or other erroneous data.

Preparing Data for Random Forest

In some cases, Climate NA will generate erroneous values of -9999 if data is unavailable for an input location - these need to be removed before any analysis is done or they will significantly skew results. The climate variable MAR (Mean Annual solar Radiation) was generated with -9999 in all observations, so it was removed from analysis.

Any 0 values of land cover (land cover type = NA) will not be useful for classification and may confuse the RF algorithm, so these were removed before continuing with analysis.

Random Forest is non-parametric, so normal distributions of variables are not required. However, we may want to check the distribution of classes to see if some classes are very rare or if there are exceedingly dominant classes (Figure S1).

We can see a few rare classes, such as tropical or sub-tropical broadleaf evergreen and tropical or sub-tropical grassland. In order to ensure these classes are represented in the training data and preserve the proportions of classes, we use a stratified sample to generate our random forest training data.

Figure S1. Count of classes in each land cover class in sample data.

Preparing ClimateNA data for DNN models

In order to perform the DNN training on our data, we first remove erroneous values as outlined in the preparation for Random Forest above, to prevent -9999 values from skewing results. We then ensure that the variables are scaled and normalized so that variables with larger values do not overshadow those with smaller values.

Figure S2. Boxplots of raw climate data distributions from ClimateNA prior to any transformation or scaling.

Boxplots of Raw ClimateNA Data

Shown here are the distributions of the annual climate data generated by ClimateNA for our 5km grid point sample. Many of the distributions appear non-normal and scale varies greatly between different variables.

Figure S3. Boxplots of scaled ClimateNA data.

Boxplots of Scaled ClimateNA Data

Shown here are the distributions of the ClimateNA annual variables after scaling. We can see that although the variables now share the same scale and are roughly centered around 0, many of the distributions appear significantly non-normal.

Figure S4. Boxplots of transformed and scaled ClimateNA data. Transformations used are log transformations with different added constants derived by trial and error and confirmed by visual assessment.

Boxplots of Transformed and Scaled ClimateNA Data

Shown here are the distributions of the ClimateNA variables after applying transformations to correct non-normal distributions and scaling. We can now see that most distributions are approximately normal and centered around zero. The data is ready for DNN training.

Appendix A: Full DNN model architecture

model = keras_model_sequential() %>%

layer_dense(units=4096, activation='relu', input_shape=c(25)) %>% #adjust for number of predictors

layer_dense(units=2048, activation='relu')%>%

layer_dense(units=1024, activation='relu')%>%

layer_dense(units=512, activation='relu') %>%

layer_dense(units=256, activation='relu') %>%

layer_dense(units=128, activation='relu') %>%

layer_dense(units=64, activation='relu') %>%

layer_dense(units=32, activation='relu') %>%

layer_dense(units=19, activation='softmax') #adjust for number of classes

model %>% compile(

loss = 'categorical_crossentropy',

optimizer = optimizer_adam(),

metrics = c('accuracy'))

history = model %>% fit(train_x, train_y,

epochs=25, batch_size=64,

validation_split=0.2)

Page updated

Report abuse