Module 18

Multiple regression II

Introduction

  • The current module will cover advanced topics of multiple regression including (1) dummy coding for categorical variables, (2) Partial correlation and Part correlation, and (3) Multicollinearity.

    • Dummy variable: From what we have learned so far, we know that linear regression is an analysis of the relationship between continuous variables. But if we do need to include categorical variable(s) as a regression predictor, we can dummify a categorical variable: Using binary numbers 0 and 1 to represent the absence and presence of a qualitative attribute.

    • Partial correlation and Part correlation: From the previous module (Module 17), we know that we can examine an association by considering the effect of a third variable. These concepts and how to implement them are so important, especially when (1) you would like to remove the confounding effect and/or (2) rule out the possible, alternative explanation of the result. More about how a third variable may influence the original effect of IV on DV will be discussed more in Module 19

1. Dummy Variable

1.1 What is a dummy variable and why do we need it?

  • A dummy variable is a numeric variable that represented categorical data with two or more levels, such as gender, faculty, race, experimental conditions, etc.

  • Creating dummy variables allows you to test the effect of a categorical IV when the statistical test only allowed continuous IV (like regression analysis).

  • Dummy variable is a dichotomous, quantitative variable that only takes two values, i.e. 0 or 1 to indicate the absence or presence of a categorical effect on the outcome, while 0 is the reference level.

    • Some variables are binary by nature, such as gender, if you have "female and male" responses identified by participants, you can recode it like female = 0, male = 1 (or vice versa), then you are comparing the effect of the male group when the female group is a reference level.

    • You can also recode continuous variable, for example, the number of relationships to a dummy variable (0 = had no experience on romantic relationship; 1 = had (at least one) experience on romantic relationships

    • Some variables have more than two groups, like faculty (1=BA, 2 = BSS, 3 = BBA) and Experimental Condition (e.g., 1 = experimental group A, 2 = experimental group B, 3 = control group). Then we will need to adopt some standard procedure to dummify the categorical IV.

1.2 How to do dummy coding on a categorical variable (conceptually)?

  • We need to create new dummy variables. The number of new dummy variables depends on the number of groups the categorical variable contains minus 1. For example, if the categorical variable has 3 levels, then the number of dummy variables will be (k - 1 = 3 - 1 ) 2; if it has 4 levels, the number of dummies will be 3, etc.

Here is the step of how to create dummy variables:

    • Assuming now we have a categorical group of Condition, which has three values: 1 = Experimental Group A, 2 = Experimental Group B, 3 = Control.

    • We will need to create two dummy variables: let name them as DummyA and DummyB.

      • Dummy A: if the Condition equals Group A (Condition == 1),
        the value of DummyA is 1, otherwise, 0.

      • Dummy B: if the Condition equals Group B (Condition == 2),
        the value of DummyB is 1, otherwise, 0.

      • And then you will find that the Control Condition will be indicated by both DummyA and DummyB are 0.

  • The illustration on the right side is what a correct dummy coding will look like. And below is one of the possible ways to create dummy variables using "Transform", which will be step-by-step described below

1.3 Example of creating a set of dummy variables for "Faculty"

An instructor s interested in whether the number of hours students spend on social network site (SNS) influence their academic performance (GPA). He collected data from the university but he noticed that it is possible students from different faculty would be a confounding factor of the SNS-GPA relation (i.e., different faculty has different grading standards, etc.). Thus, he conducted a hierarchical regression and examine, after considering students' faculty, does SNS significantly, incrementally predict GPA.

He decided to recode the Faculty (value: 1=BA; 2=BSS; 3=BBA) to two dummy variables: dummyBA and dummyBSS. This is how he did the dummy coding:

  • if the Faculty is BA, the value of dummyBA is 1; dummyBSS is 0

  • if the Faculty is BSS, the value of dummyBA is 0; dummyBSS is 1

  • if the Faculty is BBA, both the value of dummyBA and dummyBSS are 0

To perform this,

  • click "Transform", select "Source variable" as "Faculty", and then "Create New Transform"

  • first, we create a Transform for dummy BA. click "Add recode condition",

    • For dummyBA: set if $source (Faculty) == 1, use 1, else use 0

    • For dummyBSS: set if $source (Faculty) == 2, use 1, else use 0

After creating two dummy variables for Faculty, select "Linear Regression” under “Regression” in jamovi, and then

  • select

    • "GPA" as Dependent Variable

    • "SNS" as Covariates,

    • "dummyBA" & "dummyBSS" as Factors

  • Since he would like to consider the effect of Faculty on GPA first, we move "dummyBA" & "dummyBSS" in block 1, and then "SNS" in block 2.


By doing so, we can rely on the Block 1 significant F-test to tell whether the two dummy variables (dummyBA and dummyBSS) significantly predict the dependent variable.


(For other parameters like model F-test, parameter t-test, and assumption check, please refer to the previous model)


To interpret the result, we can conclude that:

  • The effect of different faculty on GPA was not significant. The dummy models (Block 1, including two dummy variables) was not significant, F(2,997) = 0.82, p = .44; none of the dummy variables were significant, ps > .19.

  • After considering the effect of different faculty, the SNS significantly predicted GPA, B = -0.03, SE = 0.01, p < .001, and explained additionally 3% of variance.

To interpret the result of dummy variables (although the current example is not significant):

  • The estimated effect of dummyBA was positive yet non-significant, indicating that BA students (versus non-BA students) have a better GPA, while the effect is not significant, B = 0.02, SE = 0.03, p = 0.45.

  • The estimated effect of dummyBSS was positive yet non-significant, indicating that BSS students (versus non-BSS students) have a better GPA, while the effect is not significant, B = 0.04, SE = 0.03, p = 0.19.

Regression model equation:

The estimated GPA = 3.296 + 0.021*dummyBA + 0.037*dummyBSS + -0.028*SNS

(given dummyBA: 1=BA, 0=nonBA and dummyBSS: 1=BSS, 0=nonBSS)

2. Partial correlation vs. Part correlation

2.1 Partial Correlation

2.1.1 What is Partial Correlation?

  • Partial correlation measures the strength of a relationship between two variables, while controlling for the effect of one or more other variables (also known as the called control variables or covariates).

  • It is very similar to the normal Pearson correlation. It can range from −1 (perfectly negatively correlated) to +1 (perfectly positively correlated). But partial correlation can be larger or smaller than the regular correlation between the two variables (still within the range -1 to +1), depending on the effect of the control variable(s).

  • Partial correlation is best thought of in terms of multiple regression. The partial correlation coefficient for a predictor X describes the relationship of Y and X when all other predictors are held fixed in the model. The r statistic displayed with the main regression results is the partial correlation.

  • The general form of partial correlation (see left equation) and that calculated from a multiple regression (see right equation) are as following:


where r is the correlation coefficient, x is the independent variable, y is the dependent variable, and z is the control variable

OR

where tk is the Student t statistic for the kth term in the linear regression model

2.1.2 Assumption of Partial Correlation

  • Before using partial correlation, there are five assumptions you need to check in order to give you a valid results:

    • Assumption 1: There must be one independent variable and one dependent variable, and they have to be continuous scale variables (i.e. measured on an interval or ratio scale).

    • Assumption 2: You have one or more control variables, also known as covariates or control variables to adjust the relationship between the independent and dependent variables. The control variables are also continuous variables.

    • Assumption 3: A linear relationship between all three variables is a must, that is, the control variables is associated with both the independent and dependent variables; and the independent variable is also associated with the dependent variable.

    • Assumption 4: There should be no significant outliers (or at minimum at least). As partial correlation is sensitive to outliers, the correlation coefficient and the line of best fit may be greatly affected, resulting in inaccurate conclusion.

    • Assumption 5: All variables should be approximately normally distributed as bivariate normality is needed between variables in order to assess the statistical significance of the partial correlation.

2.1.3 Example of partial correlation

Recent study suggest that the more time students spend on social media, the more likely they will get worse grades on school performance, particularly among secondary students. Some people argue that this occurs because of the procrastination. Somehow it remains unknown if this also apply to university students. So we want to know the relationship between hours spent on social media and GPA, given that procrastination is associated with both hours spent on social media and GPA. In order to find the true relationship, we need to control for sleeping hour.


  • Q: How do we find the relationship between “SNS"and "GPA", while control for "IPS"?

  • A: We use “Partial Correlation” under “Regression”.

Example 18.2.1 Partial correlation.mp4

Result Interpretation

  • The zero-order correlation between SNS and GPA was significant, r(998) = -.17, p < .001. After considering the effect of IPS using partial correlation, the correlation between SNS and GPA was still significant, r(997) = -.16, p < .001.

2.2.1 What is Part Correlation (/Semi-Partial Correlation)?

  • Part correlation, also known as Semi-Partial Correlation, is almost the same as partial correlation except with part correlation, the control variable hold constant for the independent variable but not the dependent variable. This reminds us a familiar concept – confounding variable.

  • In other words, part correlation only take the influence of the control variables on the independent variable into account, while does not control for the influence of the confounding variables on the dependent variable.

  • The general form of part correlation is as following:

where r is the correlation coefficient, x is the independent variable, y is the dependent variable, and z is the control variable

  • rX(Y∙Z) is sometimes referred to as a first-order part correlation, to note that the correlation only controls for one other variable.

  • The main reason for conducting the part correlation instead of partial correlation is that, part correlation can show how much unique variance the independent variable explains in relation to the total variance in the dependent variable, rather than just the variance unaccounted for by the control variables (as other variables are controlled for to prevent them getting in the way).

2.2.2 Example of part correlation

In the same study, it also suggests that sleeping hours could be another factor that influences students' school performance apart from the time they spend on social media. Some people argue that this occurs because of the trade-off effect on reducing sleeping hours. So we want to know the relationship between hours spent on social media and GPA, given that the sleeping hours is associated with social media, but not GPA. In order to find such relationship, we also need to control for the sleeping hours.

  • Q: How do we find the relationship between “SNS" and "GPA", while controling for "Sleep"?

  • A: We use “Partial Correlation” under “Regression” in jamovi.

Example 18.3.1 Part correlation.mp4

Result Interpretation

  • The zero-order correlation between SNS and GPA was significant, r(998) = -.174, p < .001. After considering the effect of Sleep using semipartial correlation on SNS, the correlation between SNS and GPA was still significant, r(997) = -.177, p < .001.

Module Exercise (4% of total course assessment)

Complete the exercise!

    • Now, if you think you're ready for the exercise, you can check your email for the link.

    • Remember to submit your answers before the deadline in order to earn the credits!