Q5 Knowledge Base Bivariate Regression

Knowledge Base - Bivariate Regression

Once Pearson’s r has been calculated, and the relationship between the two metric variables (the dependent variable and the independent variable) has been determined to be statistically significant, the next logical question is "how can Pearson r be used in Criminal Justice sciences to make predictions?" (Note: if Pearson r is not statistically significant, it is not possible to make predictions based on that pair of metric variables.)

In this example the subjects in the study are "neighborhoods".

Example: A public administration official from Chicago wanted to ascertain if there is a way to predict which neighborhoods are likely to incur the most property crimes. She conducted a study wherein a panel from the state board of crime statistics visited each neighborhood. Based on the census and other data, they rated each neighborhood on its susceptibility to property crime. Each neighborhood received a “crime score” based on a composite of selected variables.

She wanted to predict the number of property crimes per month based on “crime score.” “Property crimes” had a mean of 66.7 per month with a standard deviation of 12.928. “Crime score” had a mean of 3.5 with a standard deviation of 1.080. The relationship between property crimes and crime score produced a Pearson r of .911

Theory: Property Crimes will be higher among neighborhoods with higher crime scores

Null hypothesis: There is not a statistically significant relationship between Property crimes and Crime score

Results: Reject null, r(8) = .911, p. < .05 (rcrit = .632)

Analysis: Theory supported. r2 = .829, or 83% of the variance is explained; that leaves only 17% attributable to other factors.

How is the number of property crimes in a given neighborhood predicted when that neighborhood's crime score is known? It is a simple matter of using the “prediction equation” (also called the regression equation):

Y^ = a + b(x) + e

Where:

Y^ = The predicted score on the dependent variable. In this case, the number of property crimes for a given neighborhood

a = The constant (or intercept). It is the number of property crimes when crime score is –0-.

b = Slope -- the regression coefficient (amount of increase in property crimes for each incremental increase in crime score)

X = Crime score for the neighborhood

e = Error in prediction

When predicting a score on a dependent variable (Y^), based on that subject's score on the independent variable (X), simply plug the score on the independent variable, in place of the "X", in the regression (prediction) equation.

For a crime score of 2, insert a "2" Y^ = a + b(x) + e

for the independent variable: Y^ = a + b(2) + e

For a crime score of 5, insert a "5" Y^ = a + b(x) + e

for the independent variable: Y^ = a + b(5) + e

There are only three other terms to find values for in order to solve the equation and produce a predicted score: "a", "b", and "e".

How is “b” obtained? Simply divide the standard deviation of the dependent variable (sy) by the standard deviation of the independent variable (sx), and multiply by Pearson r:

b = r * (sy / sx)

b is the Slope the regression coefficient (amount of increase in Y for every increase in X)

r is the Pearson correlation coefficient between X and Y

sy is the Standard deviation of Y

sx is the Standard deviation of X

b = r * (sy / sx)

b = .911 * (12.9276/1.0801)

b = 10.905

How is “a” obtained? Simply multiply the regression coefficient (b) times the mean of the independent variable (), and subtract the result from the mean of the dependent variable ():

a = - b()

a is the constant (or intercept) amount of Y when X is equal to Zero.

is the mean of Y

is the mean of X

b is the regression coefficient

a = - b()

a = 66.7 - 10.905(3.5)

a = 28.533

Now it is a simple matter of plugging the values into the regression (prediction) equation.

What is the predicted number of crimes for a neighborhood that has a crime score of 2, and for neighborhood that has a crime score of 5?

Crime score = 2

Y^ = a + b(x) + e

Y^ = a + b(2) + e

Y^ = 28.533 + 10.905(2) + e

Y^ = 28.533 + 21.810 + e

Y^ = 50.343 + e

Crime score = 5

Y^ = a + b(x) + e

Y^ = a + b(5) + e

Y^ = 28.533 + 10.905(5) + e

Y^ = 28.533 + 54.524 + e

Y^ = 83.057 + e

"Y" is typically used to symbolize a score on the dependent variable. Y^ is the predicted score on the dependent variable

“a” is the constant (or intercept). It is the value on the dependent variable when the independent variable is zero, but it must be thought of as a theoretical value, since often it is impossible to have an independent variable equal to zero.

The regression coefficient is symbolized by “b”. Sometimes called the slope, it is the amount of increase in the dependent variable for every incremental increase in the independent variable. (The amount of increase in property crimes for each incremental increase in crime score)

An "X" is typically used to symbolize a score on the independent variable. In this case, it is the crime score for the neighborhood

The "e" is the error in prediction. Perfect prediction could only be accomplished if Pearson r was equal to +1 or -1.

6. Add both limits of the 95% confidence interval to the predicted score for a crime score thereby creating the 95% confidence interval around a predicted score of 2: X = 2, Y = 50.343

Therefore 5.328 * 1.96 = 10.443 + 50.343 = 60.786

Therefore 5.328 * -1.96 = -10.443 + 50.343 = 39.900

The 95% confidence interval for a predicted score of 50.343 is from 39.900 to 60.786

X = 2 X = 5

Lower limit 95%ci- 39.900 72.614

Predicted score Y^ 50.343 83.057

Upper limit 95%ci+ 60.786 93.500

How to interpret an SPSS output for bivariate regression

The following example uses data from the General Social Survey in a study to explore the relationship between the dependent variable (confidence in the existence of God) and the Independent variable (Years of Education). The descriptive statistics outline the dependent variable and the independent variables use in the prediction:

Introduction: The purpose of this study was to examine the relationship between belief in God and education, and to assist with the prediction of belief in God by using Education. It was theorized that a person’s confidence in the existence of God would decrease as their education increases.

Participants: The data for this study were taken from a random sample of 1500 respondents who participated in the 1991 General Social Survey. Complete data were obtained on 1310 of the 1500 cases.

Variables: The variable “God”, the dependent variable, was determined by the respondent’s score on a belief-in-God scale, which is on a Likert scale ranging from 1 to 6. High scores indicate a stronger confidence in the existence of God. “Educ”, the independent variable, was years of education obtained, as determined by the highest year of school the respondent completed.

Descriptive Statistics: The variable “God” had a range of 5 with a minimum of 1 and a maximum of 6, the mean was 5.21, and the standard deviation was 1.31. “Educ” had a range of 20 with a minimum of 0 and a maximum of 20, a mean of 12.89, and a standard deviation of 2.99.

Null Hypothesis: There is not a statistically significant relationship between belief in God and education.

Results: Reject, F(1, 1308) = 37.886, p = .001.

Analysis: The theory is supported. The relationship is statistically significant. The effect size of education on confidence in the existence of God is around 3% (r2 = .028). The data from this study show that using the constant as a starting point (a = 6.159) a person’s confidence in the existence of God decreases by the amount of the regression coefficient (b = -.074) for each additional year of education a person has attained. Although the relationship is statistically significant, and it is possible to make use of education as a predictor, these results should be interpreted with a degree of caution. The Magnitude of Effect is quite low (3%), suggesting that there are several additional factors that account for the variability in a person’s belief in God. It is clear that further research is necessary to explore additional independent variables that might account for the unexplained variance.