Data Exploration

Data Table

The data table for this study (figure 6) contains five categorical landscape variables and 58 continuous water biogeochemistry variables in three subgroups: organic matter indicators (n=18), physico-chemical properties (n=18), and trace metal concentrations (n=22). 

Figure 6. An excerpt of the data table used in this study, showing the measured landscape factors, organic matter parameters, physico-chemical properties, and trace metal concentrations.

Exploratory Graphics

Dissolved Organic Matter Indices: Checking for Known Relationships

Because organic matter lability is a a continuum from more protein-like to more humic-like compounds, we expect positive relationships with indices representing the same "type" of organic matter and a negative relationship between those representing opposite ends of the continuum. Scatter plots were used to confirm these expected relationships within our dataset.

Figure 7. Scatter plots of Coble Peak values of northern lakes. Peaks b and t are associated with protein-like compounds and peaks a, m, and c are associated with humic-like compounds.

Coble Peaks b and t correspond with more protein-like organic matter compounds, while peaks a, m, and c reflect more humic-like organic matter compounds. These contrasting properties are seen in the scatter plots the the left (figure 7) where positive relationships are observed between the protein-like peaks (b, t) and between humic-like peaks (m,c). Likewise, there are negative relationships between protein-like peaks (b, t) and humic-like peaks (m, c).

Peak a is typically associated with humic-like compounds but has a positive relationship with protein-like peaks b and t in this dataset. This is not overly concerning because organic matter composition is a continuum and does not have exclusive categories. Additionally, the high number of organic matter composition indices used in this study will compensate for any inconsistencies with a single index.

The Biological Index (BIX) and Fluorescence Index (FI) are both correlated with more recently produced organic matter, while the Humification Index (HIX) reflects older, more humic-like compounds. 

As expected, we see a positive relationship between BIX and FI indices and a negative relationship between these two indices with the HIX index (figure 8).

Figure 8. Scatterplots of organic matter composition index values of northern lakes. Biological index (BIX) and fluorescence index (FI) are associated with protein-like compounds and humification index (HIX) is associated with humic-like compounds.

Data Inspection & Normalization

While many water biogeochemistry parameters had points that appeared to be outliers, these points actually represent lakes with unique water biogeochemistry. Because these points represent actual observations of lake conditions, no outliers were excluded from data analysis.


Many lake samples had trace metal concentrations below the detection limit of the instrument. In this case, the concentration was recorded as the detection limit of the instrument by the analytical lab. Trace metals that had more than 1/3 of samples at or below detection limit were excluded from data analysis as these values may not accurately reflect variation in lake water biogeochemistry.

Of note, quarry sites commonly had relatively high concentrations of most trace metals that were removed due to detection limit restrictions.

Variables removed: Be, Cd, Cs, Pb, Se, Ag, Sr, Tl, U, Zn, Mn, NH3, SO


Histograms and boxplots were used to check all water biogeochemistry parameters for normality (figure 9).  Many variables had non-normal distribution, primarily skewed left, as is common with water chemistry. As planned analytical methods, such as principal component analysis, have an assumption of normality, variables with non-normal distribution were transformed using log or inverse log.

Variables Left-Skewed: Coble Peak C, C3, DOC, all physico-chemical properties (excluding pH) and trace metals

Variables Right-Skewed: Coble Peak a, pH

Figure 9. Boxplots of standardized values of  water biogeochemistry variables before (left) and after (right) transformation to achieve near-normal distribution.