Correlation (co-relation) is the degree to which there is a linear relationship between two variables. One way of calculating the correlation coefficient is the Pearson product-moment correlation coefficient.
Convention: ρ = population correlation coefficient, r = sample correlation coefficient.
As the magnitude of r increases, the strength of linear association increases.
r^2 can interpreted as variance accounted for (shared variance), e.g. r^2(xy) = 0.50 can be interpreted as "y accounts for 50% of the variance in x" or "x accounts for 50% of the variance in y" or "x and y share 50% of variance in common"
Test of significance of correlations are based on the assumptions that
the distribution of the residuals (deviations from the regression line) are normally distributed
the variability of the residual values is the same for all values of the independent variable.
Monte Carlo simulations suggest that these assumptions can be relaxed when
departure from normality is not large, and
your sample size is not too small (rule of thumb: n=50 has little chances of bias, and with n=100, you do not have to worry about the normality assumption).
Outliers
One outlier can potentially completely change the regression slope and hence the correlation value.
Detect: Always examine data with a scatterplot.
Solution: There is no general method to remove outliers, but many researchers exclude observations that are 1.5-2 SD less/greater than the group/cell mean.
Inhomogeneous samples
Sample might be grouped by some ignored factor.
Detect: Scatterplot! Examine with exploratory multivariate techniques like cluster analysis.
Solution: Run correlation analysis separately for each cluster.
Nonlinear relationship between variables
Detect: Scatterplot.
Solution: No easy answer -- there is no nonlinear version of Pearson r. You could try
Spearman r, a nonparametric correlation, for ordinal variables (and hence not affected by nonlinearity). Spearman r is generally less sensitive.
If the curve is monotonous, try transforming one or both variables to remove nonlinearity and then run correlation on transformed data (common transformation: log)
Identify a function that best describe your curve and test the goodness of fit of your data to this curve
Identify grouping variables that might be describing the underlying nonlinearity and run an ANOVA on the data
The correlation coefficient is not normally distributed when the correlation is not 0 even is if the population has a normal distribution, so you cannot test correlations using a t-test. To get around this, Fisher's transformation (arctanh) both normalizes the coefficient (tends to normal rapidly as sample size increases) and stabilizes variances (a wonderful explanation with historical context is given in Nicholas Cox's wonderful article on correlation and Fisher's z in a 2008 issue of the Stata Journal:
For any Pearson coefficient r, the Fisher transformation z of r is defined as
Common statistical misconceptions that you find even in textbooks!
(Tversky & Kahneman 1971)
The false belief that results from not properly understanding of variability, the implications of the Law of large numbers (the sample mean tends towards the population mean; alternatively said, the variability of the sample mean tends to zero as the sample size increases) leading to
underestimating the size of confidence intervals
overestimating significance of tests of hypothesis
over-confidence of replicability
sound logic: Larger samples usually lead to more exact estimates of the population mean (Finch 1998
Fallacy: Low p values have a higher effect size than results with higher p values.
Sound reasoning: Sample size and effect size both heavily influence test statistics -- low p values may result from large sample sizes but small effect sizes. This fallacy sometimes arises because of ignoring the influence of sample size on test statistics and sometimes an example of the inverse fallacy in logic. Although a large enough effect size can lead to very low p values even if the sample size is small, if the effect is small but the sample size is very large, the p value can be very small.
Fallacy: Non-significant effects mean no or negligible effect.
Sound reasoning: Like the Magnitude fallacy.
Castro Sotos AE, Vanhoof S, Van den Noortgate W, Onghena P. (2007) Students’ misconceptions of statistical inference: a review of the empirical evidence from research on statistics education. Educ Res Rev. 2: 98–113. https://doi.org/10.1016%2Fj.edurev.2007.04.001
Kühberger A, Fritz A, Lermer E & Scherndl T. (2015). The significance fallacy in inferential statistics. BMC Res Notes 8(84). https://doi.org/10.1186/s13104-015-1020-4
standardized regression coefficient (predictors and dependent variable standardized by subtracting their means and dividing by their standard deviations) = Pearson's correlation coefficient, that is
correlation = standardized covariance between dependent variable Y and predictor X; derivation:
correlation = geometric average of the slopes of the regressions of Y on X and of X on Y.
correlation = square root of R-squared using the sign of the slope of the regression of Y on X
Pearson's correlation coefficient r (sample) or ρ (population)
That is, r is the average cross-product of z-scores. When we are estimating using the population standard deviations rather than the sample standard deviations, that is, when
For a simple linear model
-- in other words, the standardized correlation coefficient.
More in detail, the least square estimates are
The amount of variation in the observed y values not explained by the model:
The total sum of squares SST and the regression sum of squares and coefficient of multiple R2 (also see Robbie Beane's nice derivation of R2):
square of Pearson's correlation coefficient = R^2 of simple linear regression
neither is a test of causality
both can test more than just linear relationships (although correlation typically refers to linear relationship and people often do linear regressions)
regression can make
focus of correlation: