Nonesense Regressions

Spurious (Nonsense) Regressions

If a pair of variable appear to have very strong relationship between them when in fact there is absolutely no relation or weak logical relationship between them.

Consider a regression Y=a+bX+e; the regressor X is assumed to determine the regressand Y if the t-value for the X is significant. There is an elegant theory which proves that the chances of getting significant t-stat for X are very small if in fact there is actually no relation between X & Y. However thi8s theory is based on certain assumptions and usually this assumptions are not valid for time series data. therefore the chances of getting a significant t-statistics are very high when using time series data, even if two variables have no logical relation between them.

Consider the following variables:

Y: Merchandise imports (current US$), Maldives

X1: Services, value added (current US$), Bangladesh

X2: Population urban, Guinea

X3: Rural population as a percentage of total, Benin

It is obvious that variable Y does not depend on any of the X variables.

The Regression 1:

Dependent variable:Merchandise imports (current US$), Maldives

Regression Output

In above mentioned output, R-square is 88%. According to the conventional interpretation, 88% of variation in the dependent variable i.e. Maldive's imports can be explained by the independent variable i.e. Services value added of Bangladesh. Whereas it is hard to findout how the services value added of Bangladesh can effect the imports of Maldives.

The t-statistics for service value added is 19.17. According to the conventional interpretation, the t-statistics is highly significant. Which shows that the variable services value added for Bangladesh cannot be excluded from the regression equation of the dependent variable. These results are very hard to be defended on theoretical grounds.

The Regression 2:

Dependent variable:Merchandise imports (current US$), Maldives

Regression Output:

In this output, R-square is 70%, which shows 70% of variation in the dependent variable i.e. Maldive's imports can be attributed to urban population of Guinea. But you cannot find out a channel how the population of Guinea can effect Maldivian imports.

The t-statistics for Urban population of Guinea 10.65. According to the conventional interpretation, the t-statistics is highly significant. Which shows that the variable cannot be excluded from the regression equation of the dependent variable.

The Regression 3:

Dependent variable:Merchandise imports (current US$), Maldives

Regression Output:

Like the above cited outputs, this output also shows very strong relation between the Maldivian imports and percentage of urban population in Benin. This relation is also unable to be defended on theoretical grounds.

How to avoid the The spurious regression:

The existence of spurious regression has been subject of lot of debate. It is hard to understand this debate and how to cope with the problem of spurious regression. However some remarkable lessons are summarized below:

1. The conventional measures of goodness of fit and strength of relation between two variables are simply invalid for time series data. The R-square and t-statistics are misleading and do not tell anything reasonable about the nature of relationship between variables. There are lot of sophisticated tools that can cope with the problem of spurious regression and students must learn the time series techniques if they want to avoid the spurious regression.

2. For the cross sectional data, one can also encounter with the problem of spurious regression. Visit the page "Spurious Regression in Cross Sectional Data" to see some examples of spurious regression in the cross sectional data. However, the main lessons to avoid spurious regression are following:

a. Include all relevant variables in the regression. A regression with missing relevant variable creates biased results.

b. Do not merge the groups that are not homogeneous. See the page "Spurious Regression in Cross Sectional Data" to find out details about non-homogeneous groups.

c. Regression to be estimated must be based on strong theoretical grounds. There should be clear link between independent and dependent variables.

LESSONS from spurious regressions

We cannot start a research project by asking what is the effect of X on Y = some questions of this type make sense, while others do not. For example, if I ask: what is the effect of fertilizer on crop yields, this makes sense and we can run regressions to find out.

If I ask: what is the effect of migration on crop yields, this does not make sense and regression will not give us an answer.