Correlation Assumptions

Correlation Assumptions

The premise of a correlation coefficient (and all other statistics) is that the data are a sample of observed points taken from a larger population. We have not examined the entire population because it is not possible or feasible to do so. We are examining the sample to draw a conclusion about whether the linear relationship that we see between x and y in the sample data provides strong enough evidence so that we can conclude that there is a linear relationship between x and y in the population. However, before we can conduct a correlation analysis, we need to make sure the assumptions of the analysis are met. Every analysis makes some assumptions about the data, so you should get used to this part of statistics.

The specific assumptions underlying the correlation coefficient are:

1. There is a linear relationship in the population between the y variable and the x variable.
2. The population distributions from which the x and y variable come from are normally distributed.
3. The standard deviations of the population
y values about the line are equal for each value of x. In other words, each of these normal distributions of y values has the same shape and spread about the line. In statistics, we call this homoscedasticity. (See figure below)
4. The data is at least ordinal data. It could also be interval or ratio.


In the figure above (a visual depiction of assumption 3), the y values for each x value are normally distributed about the line with the same standard deviation. For each x value, the mean of the y values lies on the line of best fit. More y values lie near the line than are scattered further away from the line.

References:

  1. https://courses.lumenlearning.com/introstats1/chapter/testing-the-significance-of-the-correlation-coefficient/

LICENSES AND ATTRIBUTIONS

CC LICENSED CONTENT, SHARED PREVIOUSLY