Heteroscedasticity

The main idea...

Heteroscedasticity, meaning "differing dispersion", occurs when the variability of a random variable is correlated to the magnitude of the variable (i.e. the size of its values), conditional on some other variable (Figure 1). This violates the assumption of equal variance of residuals held by most linear hypothesis-testing methods and renders many significance tests and confidence interval estimations invalid. It also reduces the "efficiency" of, for example, ordinary-least-squares (OLS) approaches to regression. This means that methods such as OLS will not effectively minimise the variance between the values of the regressand and regressor when establishing a fit.

Figure 1: Illustrations of a) homoscedastic and b) heteroscedastic data. The variability of Y conditional on X is dependent on the magnitude of X and Y in panel b. If heteroscedasticity occurs in a collection of variables, analyses that depend on variability (variance, standard deviation, etc.) being uncorrelated with the magnitude of a given variable will be invalidated. This is particularly harmful for many significance-testing methods.

Testing for heteroscedasticity

The Breusch–Pagan test and White (1980) test are appropriate to test for continuous changes in variance (e.g. Figure 1b) in regression models. Both tests perform well for large samples (n > 100) and may not be accurate for smaller samples. The Breusch–Pagan test is interactive, as the researcher can specify which explanatory variables are of interest. Selection of appropriate explanatory variables is essential for a good test.
The Goldfeld–Quandt test is appropriate to detect "lumpy" changes in variability (i.e. imagine a series of bulges in the scatter of Figure 1a). This test examines the variance of residuals in subgroups of data points. The number of subgroups must be defined by the researcher and often follows natural or imposed grouping in the data set (e.g. "treated" and "untreated" experimental plots)
To test for multivariate heteroscedasticity, Holgersson and Shukur (2004) suggest the Wald, Lagrange multiplier (LM), likelihood ratio (LR) and the multivariate Rao F-test.

Correcting for heteroscedasticity

If the change in variability with magnitude are regular, some data transformations may remove heteroscedastic behaviour. Normalising transformations often result in more homoscedastic behaviour between variables. This is the simplest correction and should be explored first.
To improve parameter estimates derived from heteroscedastic data, the influence of "noisier" data (i.e. data points which show high variability relative to a model) can be down-weighted in approaches such as weighted-least-squares regression. The higher the weight attributed to a given data point, the more precise that data point is asserted to be. Weights are often set to be inversely proportional to a data point's variance. This approach requires confident estimations of weights and appropriate handling of outliers, which are assumed to be correct.
White's heteroscedasticity-consistent covariance matrix (HCCM; White, 1980) approach has been widely adopted, implemented, and built upon; however, may not perform well on datasets with less than 200 objects. Tests based on HCCM and more appropriate to small sample sizes have been proposed (Cribari-Neto, 2004; Long and Ervin, 2000) with new approaches still emerging (e.g. Cribari-Neto and Da Silva, 2011).

Ignoring heteroscedasticity

Ignoring heteroscedasticity may result in less precise (yet still unbiased) parameter estimates under OLS approaches. Establishing how imprecise the parameter estimates are may be challenging, however. Standard covariance matrix estimators (in contrast to those from HCCMs) resulting from OLS approaches do become biased and will affect any method referencing them. Error estimation from the residuals of a regression on heteroscedastic data will almost certainly be incorrect and any hypothesis tests referencing them will be invalid.

References

Cribari-Neto F (2004) Asymptotic inference under heteroskedasticity of unknown form. Comput Stat Data Anal. 45(2): 215-233.
Cribari-Neto F, Da Silva WB (2011) A new heteroskedasticity-consistent covariance matrix estimator for the linear regression model. Adv Stat Anal. 95(2): 129–146.
Holgersson HET, Shukur G (2004) Testing for multivariate heteroscedasticity. J Stat Comput Sim. 74(12): 879-896.
Long JS, Ervin LH (2000) Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model. Am Stat. 54(3): 217-224.
White H (1980) A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica. 48(4): 817-838.

Implementations

R
- bptest() from the lmtest package performs the Breusch–Pagan test on a linear model object.
- gqtest() from the lmtest package performs the Goldfeld–Quandt test on a linear model object.
- hccm() from the car package performs White's correction for covariance matrices.
- The package sandwich contains functions able to generate model-robust standard error estimators for cross-sectional, time series and longitudinal data.