This page describes the preliminary data exploration before analysis, including checking distributions, transformations, and removing outliers or other erroneous data, as well as some visualizations of the dataset used.
The ecozone data was spatially joined to the weather station observations to assign ecozone codes to each weather station. Observations missing ecozone data were removed.
A log transformation (f(x) = log(x+1)) was applied to the precipitation data in order to correct skewness, and missing average temperature values were calculated from minimum and maximum temperature readings.
A linear model was created between monthly climate observations (mean monthly precipitation and average monthly temperature) and the Level 4 ecozone delineations.
Residuals were extracted from the linear models, with both the average and maximum residuals calculated to identify observations deviating most from the ecozone averages.
The modified dataset, including residuals, was exported to Excel and sorted by maximum residual, with the largest deviations listed at the top of the sheet.
A heatmap was applied to monthly climate observations and their residuals, and a visual inspection was performed on the highest magnitude (positive and negative) residuals to detect observations with significant deviations from expected patterns.
No significant outliers were found and the data is ready for further analysis.
These represent distributions of the monthly average temperatures reported by the weather stations. They are subdivided into biomes to enhance readability.
Some clear seasonal patterns can be observed here, as well as approximately normal and fairly tight distributions.
These represent distributions of the monthly average precipitation reported by the weather stations. They are subdivided into biomes to enhance readability.
Seasonal patterns are less clear here than in the temperature data but some are still present. The y-axis is in log scale to better show the distributions. There is considerably more variance present within many of the boxplots than that seen in the boxplots of temperature.
In order to perform the principal component analysis on our data, we first must check the assumptions of normality and ensure that the variables are scaled so that variables with larger values do not overshadow those with smaller values.
Shown here are the distributions of the annual climate data generated by ClimateNA for our 5km grid point sample. Many of the distributions appear non-normal and scale varies greatly between different variables.
Shown here are the distributions of the ClimateNA annual variables after scaling. We can see that although the variables now share the same scale and are roughly centered around 0, many of the distributions appear significantly non-normal.
Shown here are the distributions of the ClimateNA variables after applying transformations to correct non-normal distributions and scaling. We can now see that most distributions are approximately normal and centered around zero. The data is ready for PCA.