Regression

When to use regression instead of correlation?

The correlation coefficient indicates linear correlation. If you are only interested in the linear relationship between variables (i.e., how well they fit on a straight line), then correlation is a good choice.

Sometimes, however, you are interested in how much a change in X1 (usually scale level) contributes to the changes in Y (usually scale level) (i.e., what is the slope of the straight line?) To answer this question, we conduct a linear regression analysis with X1 as independent variable and Y as dependent variable.

Unlike correlation, linear regression could deal with more than one independent variable X1, X2, X3, .... at a time. We can conduct a linear regression analysis with X1, X2, X3, ... as independent variables, and Y as dependent variable.

Note: If you have a more complicated model that involves more than two levels of variables, e.g., X1, X2, X3 affect Y1, Y2, and then X2, X3, Y1 affect Z1, Z2, ... then the linear regression will not work. In this case you need structural equation modelling (SEM), which is a more advanced technique and is beyond the scope of this website. SEM is not available in the standard installation of SPSS. You need AMOS instead, which I think is very difficult and troublesome to use. Instead, I would recommend the SEMLj module of Jamovi for that purpose. You can see instructions here: https://semlj.github.io/gui.html.

SPSS

Procedures

To run a simple regression analysis, do Analyze -> Regression -> Linear:
- Put the dependent variable (Y) into the Dependent field. Put all the independent variables (X1, X2, X3, …) into the Independent(s) field.
- Click the Statistics button. Choose Estimates under Regression Coefficient. Also check Model fit on the right.
If you want to show all the scatters plots in a scatter plot matrix (useful for a quick glance), you can use Graphs -> Chart Building -> Scatter/Dot -> Scatterplot matrix, and then drag all the variables concerned into the chart. You can also choose to fit lines into the scatter plots by selecting Linear fit lines in the Element Properties.

Interpretation

SPSS will generate a few tables in the output.
- In the Model Summary table, the Adjusted R Square (a value between 0 and 1 inclusive) tells you how good the model fits into the data. A value close to 1 is good. A value close to 0 means it is a poor fit.
- In the ANOVA table, if the Sig. (p-value) is <0.05 or other levels of significance, it means the independent variables can reliably predict the dependent variables.
- In the Coefficients table, the Unstandardized Coefficients B and the Standardized Coefficients Beta values tell you the strength of the effects of the corresponding variable. The Sig. value, if <0.05 or other levels of significance, tells you that the coefficient is statistically different from zero. Here you can compare the coefficients and find out the most important factors of Y. The larger the coefficient, the larger the effect of the factor on Y. On the other hand, if Sig. is large, it means that the coefficient is not different from zero, or in other words the variable has no effect on Y.
- If you are doing a one-tailed test, i.e., you want to know if the coefficient is positive / negative (instead of just non-zero), then you should divide the p-value by two to get the one-tailed p-value, and draw your conclusions from it.

Jamovi

Procedures:

To run a simple regression analysis, do Analyses -> Regression -> Linear Regression:
- Put the dependent variable (Y) into the Dependent Variable box. Put all the independent variables (X1, X2, X3, …) into the Covariates box.
- If you have nominal or ordinal independent variables, put them into the Factors box.
- Under Model Fit, choose R and adjusted R-squared.
- Under Model Coefficients, choose Standardized Estimate.
If you prefer to show a scatter plot of the variables, you may install the scatr module in Jamovi. After installation, go to Analyses -> Exploration -> Scatterplot. Inside the dialog, choose the X and Y variables, and then choose Linear under Regression Line.

Interpretation:

Jamovi will generate a few tables in the output.
- In the Model Fit Measures table, the Adjusted R Square (a value between 0 and 1 inclusive) tells you how good the model fits into the data. A value close to 1 is good. A value close to 0 means it is a poor fit.
- In the Model Coefficients table, the Standardized Estimate values tell you the strength of the effects of the corresponding variable. The p value, if <0.05 or other levels of significance, tells you that the coefficient is statistically different from zero. Here you can compare the coefficients and find out the most important factors of Y. The larger the coefficient, the larger the effect of the factor on Y. On the other hand, if p is large, it means that the coefficient is not different from zero, or in other words the variable has no effect on Y.
- If you are doing a one-tailed test, i.e., you want to know if the coefficient is positive / negative (instead of just non-zero), then you should divide the p-value by two to get the one-tailed p-value, and draw your conclusions from it.

Note #1: If the dependent variable is ordinal

If the dependent variable is ordinal, you can use ordinal regression:

SPSS: Analyze -> Regression -> Ordinal
Jamovi: Analyses -> Regression -> Logistic Regression -> Ordinal Outcomes

which is unfortunately much more complicated. If you are dealing with only one independent variable at a time, you can use chi-squared test instead.

Note #2: If the dependent variable is nominal

If the dependent variable is normal with two levels (e.g., Pass vs. Fail), you can use binary logistic regression:

SPSS: Analyze -> Regression -> Binary Logistic
Jamovi: Analyses -> Regression -> Logistic Regression -> 2 Outcomes Binomial.

If there are more than two levels, use multinomial logistic:

SPSS: Analyze -> Regression -> Multinomial Logistic.
Jamovi: Analyses -> Regression -> Logistic Regression -> N Outcomes Multinomial.

Again the interpretation is less straightforward than the regular regression for scale level variables. If you are dealing with only one independent variable at a time, you can use chi-squared test instead.

Note #3: Different units of measurement

If your variables are not measured in the same unit, then you need to rescale the variables.

For example, if variable A is in [1,5] but M is in [1,3], then the regression will be problematic. The easiest way to fix the problem is to use the standardized coefficients (instead of the non-standardized coefficients) in the output to interpret your results. This way, all the variables are automatically rescaled as standardized z scores so that they can be used together in the regression.

Alternatively, you can also do the rescale manually if you so prefer. For example, suppose that A is in [1,5] but you want to rescale it to [0,1], then you can apply the following formula in Compute Variable and use the new rescaled variable in your regression:

A_rescaled = (A-1)/4

When A=1, A_rescaled becomes 0. When A=5, A_rescaled becomes 1.

(However, see also Note #4 below if you have moderator effect in your model.)

Note #4: Moderator effect

If you want to study the effect of A on B, but A is moderated by another variable M, then you need to consider the moderator effect as well.

For example, if you think that service quality would affect customer satisfaction, but such effect could be different for customers of different ages, then age is a moderator. You need to consider its effect as well.

To do this, use Compute Variable to create an interaction term using the formula A*M (i.e., multiply the two variables together), and then add this new variable as an independent variable (along with A, M, and other independent variables) in the regression analysis. The coefficient of this interaction term will tell you whether the moderator effect exists or not.

If want to further interpret how the moderator M moderates the effect of A on B, then you need to consider all the statistically significant coefficients of A, M, and A*M. Let's say you find the following statisticlly significant relationship:

B = c1*A + c2*M + c3*A*M

Then, you can rearrange the terms to get:

B = (c3*M + c1)*A + c2*M

This form shows clearly that the "coefficient" of A is (c3*M+c1), which is not a constant but depends on M. That is exactly the meaning of moderator effect. In addition, the moderator also has a direct effect on B, as indicated in the term c2*M.

The same complication of different units of measurement applies here. Unless the variables A and M are already scaled to [0,1] using the method in Note #3 above, the resulting interaction term A*M will have a different range than A and M individually. For example, if A is from 1 to 5, and M is also from 1 to 5, then A*M could give you anything from 1 to 25, which is inappropriate for use in regression.

To fix the problem, again either you use the standardized coefficients to interpret your findings, or you rescale the variables manually.

To rescale manually, you can rescale (using Compute Variable) all the variables so that they are between 0 and 1:

A_rescaled = (A-1)/4

M_rescaled = (M-1)/4

Since both A_rescaled and M_rescaled are now in [0,1], their product A*M will also be in [0,1]:

A_M_rescaled = A_rescaled * M_rescaled

Or if you don't want to touch those variables, you can modify the formula of A*M as follows:

A_M_rescaled = ((A-1)/4)*((M-1)/4)*4+1

So that A_M_rescaled will now go from 1 to 5 as well. Adjust the numbers above if you have other ranges in the original variables.

Google Sites

Report abuse