This page uses the following spreadsheet. You should open it in a new window for reference, and you can make your own personal copy to play around with the variables.
Partial Correlation is a technique used to identify the impact of each variable when multiple variables correlate with each other.
For example, let's consider a hypothetical study with the following variables:
Motivation (M): Students are surveyed on their motivation to study science
Engagement (E): This is a metric calculated based on engagement within SLS
Learning (L): This is the difference between students' grades on a post-test vs pre-test
Columns A to D are used to generate an artificial dataset. You can adjust the parameters in rows 2-4 of columns B-D to see how the analysis continues to apply (these are the yellow boxes)
In this case study, we take advantage of the fact that this is an artificial dataset to set up a clear-cut example.
Let's tune the parameters behind the scenes such that
Engagement is correlated to motivation: E = bₑ M + cₑ
Learning is correlated to engagement: L = bₗ E + cₗ
We also introduce a normally-distributed random error to simulate a real world spread of results. The amount of error can be adjusted using row 3 (labeled as "error tuning" in the spreadsheet).
You can see from the scatterplots in columns F to K that:
As expected, there is a correlation between Engagement and Motivation.
Similarly, there is also a correlation between Learning and Engagement.
However, because Learning correlates to Engagement, which in turn correlates to Motivation, we end up with a correlation between Learning and Motivation!
Even though we expected correlations #1 and #2, correlation #3 is an emergent property that gives the illusion of a direct relationship between learning and motivation!
Partial correlation is used to adjust for the effect of a particular variable (usually phrased as "partial-ed out"). In this case, we will want to adjust for engagement to see whether learning truly affects motivation (i.e. we "partial out" the effect of engagement).
Let's pretend that we don't know that this is an artificial dataset where we already know what is the 'true' relationship. Maybe in a real-world experiment, we might suspect there are some partial correlations in the data. So we propose the following hypothesis:
Engagement correlates to Motivation
Learning correlates to Engagement
which is expressed in the following model:
Now, we want to test this hypothesis!
The core philosophy is as follows:
If the relationship between Motivation and Engagement is known, we can determine the expected value of Motivation for each value of Engagement.
Similarly, if the relationship between Learning and Engagement is known, we can determine the expected value of Learning for each value of Engagement.
Comparing the expected value of Motivation and Learning to the actual data (observed value) of Motivation and Learning, we see some discrepancies.
We then compare (Observed value – Expected value) for each matched pair of Motivation and Learning corresponding to the same value of Engagement. The reason for this is that by subtracting the expected value that was calculated based on Engagement, we have removed the expected influence of Engagement.
Therefore, plotting (Observed value – Expected value) for Learning against Motivation will tell us the influence of Motivation on Learning after partialling-out the effect of Engagement.
This is shown in Columns L to O of the spreadsheet:
You can see that there is pretty much no evidence of a direct relationship between Learning and Motivation: the regression weight is close to zero, and the coefficient of determination (r²) is extremely low, suggesting that the spread is mostly random and not closely related to the line of best fit.
Our conclusion is that there is no evidence of a relationship between Learning and Motivation, so the apparent effect between Learning and Motivation is actually due to the mediating effect of Engagement.
For the sake of demonstration, in columns Q to U, we test the opposite hypothesis:
The result is as shown:
From this, we can see that even after removing the mediating effect of motivation, there is still evidence of a correlation between Learning and Engagement. Therefore, we are more likely to reject this new hypothesis as compared to our previous hypothesis.
Additionally, the obtained weight (b = -0.987) is fairly close to the value used to generate this artificial dataset (b = -1). Naturally, the intercept c is different, as the process of residuals centers the graph back at 0.
For further technical reading, see: https://en.wikipedia.org/wiki/Partial_correlation
It is actually very uncommon for researchers to actually calculate (Observed – Expected) values (i.e. residuals), because the partial correlation can be simply calculated using the following general formula:
where r₁₂,₃ is the correlation coefficient r between variables 1 and 2 after partialing out the effect of variable 3, while r₁₂ is the correlation coefficient between variables 1 and 2; and so on.
Reminder: r² is the coefficient of determination. Taking the root of r² gives r, which is the correlation coefficient.
See also: https://en.wikipedia.org/wiki/Coefficient_of_determination#Coefficient_of_partial_determination
Furthermore, you can simply run the data through software solutions (e.g. SPSS, R, Python) and obtain these values automatically. So the scatterplots above serve no practical purpose and are only to provide an understanding of the underlying principles of partial correlation for the learner.
The functionality of the embedded spreadsheet is fairly straightforward.
Columns L to O perform partial correlation to remove the effect of Engagement to evaluate the direct relationship between Learning and Motivation.
In rows 3-4, the coefficients for the linear relationships are calculated.
Row 3 yields the coefficients for M = bₘ E + cₘ, allowing us to predict Motivation for a given value of Engagement.
Row 4 yields the coefficients for L = bₗ E + cₗ, allowing us to predict Learning for a given value of Engagement.
Below the scatterplot, column L and M make use of the coefficients in rows 3-4 to calculate the expected values, and columns N and O calculate (Observed – Expected) values for each row from the raw data. We then plot columns N and O in the scatterplot.
Columns Q to T have identical functionality, but instead remove the effect of Motivation to determine the relationship between Learning v.s. Engagement.