10 Correlation Analysis
For illustration, we are going to Boston Housing price data. You can find the description in the link below.
> url = "https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"
> housing = read.table(url)
How many variables and rows are there?
506 observations and 14 variables.
Set user friendly names to the columns
> colnames(housing) = c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE" ,"DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
Find correlation among the variables
Reduce decimal places in the correlation matrix
> round(cor(housing), 2)
Find answers to the following questions
- Which variables are correlated with MEDV having absolute value of correlation coefficient > 0.5?
- Which variables are positively correlated with MEDV and which are negatively correlated?
- Which variables are strongly correlated (absolutely correlation coefficient is close to 0.80)?
Find confidence interval for a pair of variables
$ cor.test(housing$MEDV, housing$LSTAT)
Find statistical significance for the variables. For that you can use Hmisc package.
p-values are confidence score of the correlation. Hypothesis tests use p-value to weigh the strength of the evidence (what the data are telling you about the population). It is number between 0 to 1. Interpretation is like this:
- A small p value (< 0.05) indicates strong evidence against null hypothesis, so you reject it.
- A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you cannot reject it.
- Marginal (p = 0.05) non conclusive.
Exercise: find correlation based on Swiss fertility dataset.