Data Walkthrough

Assignment of geological classes

Figure 3: Bar plot of lake chemistry samples used in analysis divided by both geomorphic and geological classification

The geomorphic classes assigned to each lake datapoint show notable overlap with the mapped surficial geology units each datapoint sits within. The majority of organic lakes occur within organic deposits, and the majority of moraine lakes occur within moraine deposits of varying geomorphology (Mk, Mh, Mp, Mv), with the two other most contributing categories being glaciofluvial sediments which contained ice, and marine sediments reworked by glacial activity (Gx, Yk). Two geomorphic classes (alluvial plain and ice-contact deposit) occur entirely within one surficial geology class (Ap, ICD).

Three surficial geology classes were excluded from analysis due to small sample size: Yk, Mv and C.

Chemistry Data

Table 4: Summary of chemical variables available in dataset (NitrateNitriteN omitted due to low sample size) Abbreviations which do not follow chemical notation are as follows: Alk=alkalinity, ColourAp and ColourTrue=water colour, Cond=conductivity, DN=dissolved N, DOC= dissolved organic carbon, DP=dissolved phosphorus, Hard=water hardness, OP=orthophosphate, SO=sulfur compounds, TDS=total dissolved solids, TN=total nitrogen, TP=total phosphorus, TSS=total suspended solids

Figure 4: Percentage of lakes per geomorphic class analyzed for each chemical parameter

Figure 5: Percentage of lakes per geological class analyzed for each chemical parameter

The major ions and nutrients analyzed in this sampling campaign are shown in Table 4. Due to inconsistencies between sampling campaigns and budget constraints, not all lake samples were analyzed for all chemical components (Figure 4). Notably, only one RLL sample had data for the majority of the chemical suite.

Because of this, it was decided that two sets of variables would be used, one which included the RLL class (and was therefore limited), and a broader suite which excluded all RLL samples.

Fortunately, when working with geological units the missing data was better-distributed, and two of the units missing most were omitted entirely due to small sample size (Yk and C). (Figure 5)

Imputation with RandomForest

Out of the variables which were missing rows/samples, a missForest (RandomForest-derivative package) imputation was tested by artificially removing datapoints, and comparing the model predictions to the true values. From these imputations, it was determined that Alk, ColourAp, ColourTrue, Cond, DN, Hard, pH, SO and TDS could be imputed to fill in missing values. Imputations of DP, F, NH3, NO3 and Si were not satisfactory; decisions were made on a case-by-case basis whether to omit these variables or remove only NA-containing datapoints.

As the RLL-class would only have one datapoint (Sample size of 11, ~90% of datapoints missing for the majority of chemical parameters in Figure 3), this class was omitted from the data subset undergoing imputation.

Figures 6 and 7: Imputation performance measured by correlation of predicted and observed values with increasing percentages of the dataset removed (left), and with 20% artificially removed for validation (right)

Data Normalization

The majority of the lake chemistry data was right-tailed (Figure 8), and transformed to approximate normal distribution (Figure 9) using square root transformations with a constant value determined through manual trial-and-error; minimum values per variable were calculated to ensure all datapoints remained above zero before transformation. Data transformations were tested on each variable before imputation, and performed after imputation of the dataset. Transformations are available in Table 3, below.

Figures 8 and 9: Raw (left) and transformed water chemistry values. Note the omission of OP and NO2 from transformation.

Final Variable Sets

For the subset of datapoints containing the RLL class, TN, TP, Ca, DOC, K, Mg, and Na were used; none of these variables required imputation.

For the subset of datapoints omitting the RLL class, all variables were used, save the conditions listed in Table 4. The decision to omit F was made due to the inconsistencies in the data itself, likely reflecting error in sample handling. The decision to omit Si was made due to the number of rows which would be omitted if it was left in.

Table 5: Transformations and imputations performed on variables. NAs denote variables which were present for all rows. Asterisks denote variables which were omitted after imputation was attempted.

Table 6: Variables omitted from analysis.

Preliminary pca

This principal component analysis was not the objective of the research, but does allow for better visualization and understanding of the lake chemistry data. When all chemical variables in the analysis were included (this requiring the omission of RLL-classified datapoints), overall variation between samples occurred in terms of cations (roughly parallel to PC1), and nutrients (roughly parallel to PC2). TDS, which is the weight of evaporite from a given volume of water, sits between these two axes. One geomorphic class, polygonal patterned ponding, exhibited much more variation than the other classes.

Figure 10: Principal component analysis of extended chemical suite grouped by geomorphic class, used to visualize trends in variation before further analysis

Page updated

Report abuse