This tutorial will use the UCR data which is loaded with the rest of the class datasets. This data was taken from my Introduction to Probability and Statistics for Engineering and the Sciences at UCR in 2004. We’ll use the Height and Weight variables.
Review of Correlation
Correlation is a number between -1 and 1 to describe the strength and kind of linear association between two numeric variables.
The strength is given by (disregard the sign):
0 to 0.2 no correlation
0.2 to 0.6 weak correlation
0.6 to 0.8 moderate correlation
0.8 to 1.0 strong correlation
These ranges are a rule of thumb, which varies by discipline. Psychologists are stoked if they get a 0.6 correlation and physicists tend to be uninterested if theirs is below 0.9.
The kind of correlation is positive is the sign is positive and negative if the sign is negative
The square of the correlation is the coefficient of determination, which is the proportion of the variation in the response variable explained by the explanatory variable.
cor() Command
> attach(UCR) > cor(Height,Weight)
Descriptive statistics should always be accompanied by their corresponding graphs. For correlation, it is the scatterplot:
> plot(Height, Weight)
No major outliers or strange shape confirms the genera linear trend indicated by the correlation. This is a moderate positive linear correlation. Height could be used to predict weight for these students.
> cor(Height,Weight)^2
About 62% of the variation in the student's weights can be explained by the variation in their height.
> cor.test(Height,Weight)
The null hypothesis of zero correlation can easily be tested using R. Our hypothesis of a relationship between height and weight is confirmed. In addition, we're 95% confident that the true correlation in the population is 0.69 to 0.85. It is almost certain that the true correlation is not zero (p-value < 2.2e-16).
If we were to look for other correlations between the numeric variables, such as between Height and GPA or Height and the number of hours of TV watched per week, we could not find a relationship.
> plot(Height, GPA)
> cor(Height, GPA)
> cor.test(Height, GPA)
> plot(Height, TVWk)
> cor.test(Height, TVWk)
Using cor() alone on these variables returns "NA" because a few students didn't respond to some of the items. Using cor.test() gives the correlation along with the rest of the test information.
> detach(UCR)
Exercises
This exercise uses the States95 dataset. For directions on how to access it, see the Getting Started with R tutorial.
1. Select three numeric variables you'd like to explore the relationship between. Make scatterplots between each pair. Compute the correlation coefficient between each pair. What do you observe?
2. Here is a command for creating multiple scatterplots in one display:
> library(car)
> scatterplotMatrix(States95[,c(3,6,9)],smoother=FALSE,groups=region,diagonal="hist",col=2:5)