It seems to me that when the assumptions of linear regression are broken, it gives rise to more statistical techniques and creates more employment opportunities for statisticians :)
Assumption # 1: The relationship between the dependent variable and the independent variables is linear.
Diagnostics used to check assumptions: A scatter plot of dependent variable versus the independent variable(s) can help validate this assumption by determining whether the nature of relationship between the two variables is positive, negative, or no relationship.
Solution:
Apply nonlinear transformation (log/inverse) to the dependent and/or independent variables.
Use log transformation if most data points have positive values.
Add a nonlinear term like a polynomial function to the regression equation (i.e. use nonlinear regression).
Assumption # 2: Error terms has zero population mean.
Diagnostics used to check assumptions: N/A
Solution: Add a constant / intercept term to the model.
Assumption # 3: The independent variables are uncorrelated with the error terms.
Diagnostics used to check assumptions: If the equations are linear, check to see if the number of equations against the number of unknowns to see if they are equal and if we can solve. In they are nonlinear, check to see if the number of reduced-form parameters is the same as the number of parameters in the original structure equation.
Solution:
Simultaneous equation modeling can be used to correct for this type of endogeneity bias by using a two-stage least squares (2SLS) procedure to simultaneously estimate two equations in which the dependent variable in one equation is a predictive variable in another. In the first stage, the first dependent variable (y1) was regressed on the second dependent variable and other exogenous variables and the predicted values from these regressions are obtained. Similarly, the second dependent variable (y2) is regressed on the first dependent variable and other exogenous variables and the predicted values from these regressions are obtained. In the second stage, the predicted values computed from the first-stage regression models are substituted as explanatory variables to again obtain the predicted values for y1 and y2.
Assumption # 4: Independence (the error terms across observations should not be correlated with each other - no serial correlation in the residuals).
Diagnostics used to check assumptions:
1) Examine the autocorrelation plot of residuals to see if the residuals autocorrelations fall within the 95% confidence bands around zero, which is about -2/sqrt(n) < autocorrelation < 2/sqrt(n). Pay close attention to significant correlations at the first couple of lags and in the vicinity of the seasonal period.
2) Use the Durbin-Watson statistic to test whether the residual autocorrelation are significant at the first lag.
Solution:
1) If the positive serial correlation is not serious, say the residual autocorrelation at lag-1 is between 0.2 to 0.4, or a Durbin-Watson statistic between 1.2 and 1.6, it is possible to fine tune the model by adding the lags of the dependent variable and/or the lags of the independent variables.
* Add AR=1 and/or MA=1 term if you are running Proc ARIMA.
2) If there is negative correlation in the residuals, say the residual autocorrelation at lag-1 is greater than -0.3 or Durbin-Watson statistic is greater than 2.6, this may suggest that the lagged variables is unable to correct for the existing serial correlation because some of the variables in the model is overdifferenced.
3) If there is significant correlation at the seasonal period, say at lag 4 for quarterly data, lag 6 for semi-annual data, or lag 12 for monthly data, this may indicate that seasonality is not properly accounted for in the model. To avoid seasonal patterns leaking into your forecasts, make sure that all your dependent and independent variables are properly adjusted for seasonality.
4) If serial correlation is a serious problem, say a Durbin-Watson statistic well below 1.0, autocorrelations well above 0.5), this may indicate a problem in the structure of the model. It may be that the necessary transformations may not have been applied to the dependent and/or independent variables. It may also be that the time series data is non-stationary because the appropriate differencing technique has not been applied to one or more variables in the dataset. This is especially so if the ACF plot reveals a slow decay over time.
Assumption # 5: Error terms are homoscedastic.
Diagnostics used to check assumptions:
1) Run pairwise scatter plots of residuals versus time plot bounce randomly or there seem to be a trend towards higher residuals over time, suggesting that serial correlation may be a concern / problem.
2) Run pairwise scatter plots of residuals versus predicted values to see if the residuals have a constant variance or heteroscedasticity might be a problem (a fan or funnel shaped pattern). Examine the White’s test and the Breusch-Pagan test using Proc IML and/or Proc Model to determine if the error terms are indeed heteroscedastic.
Solution:
1) If residuals are indeed heteroscedastic, we can try transforming the dependent variable or one or more independent variables into square root, inverse, or logarithm. However, we often rely on our prior experience when visually inspecting residual plots can, this lead to problems like subjective interpretation. Other possible solution can also include:
2) Construct a hypothesis to test for constant variance.
3) After the necessary transformation(s), run pairwise scatter plots of residuals versus predicted values as well as Proc IML and/or Proc Model to examine the White’s test and the Breusch-Pagan test to see if heteroscedasticity remains a problem.
4) Weighted least square.
5) Median regression.
6) Use heteroskedasticity-robust standard errors.
7) Use shorter intervals of data in which volatility is more nearly constant.
8) Using an autoregressive model to fit the error variance in an ARCH (auto-regressive conditional heteroscedasticity) model.
Assumption # 6: The error terms are normally distributed.
Diagnostics used to check assumptions:
1) Do a normal probability or a Q-Q plot of residuals and if the residuals appear to form a straight line, we can be confident that the standardized residuals are normally distributed.
2) Conduct the Shapiro-Wilk test or the Kolmogorov-Smirnov test to test of hypothesis of whether the data is normally distributed.
Solution:
The influential observations and/or outliers should not be removed from the dataset if they are not the result of data entry errors, constitute events that arelikely to recur in the future, provide valuable information about the values of some of the coefficients, and provide accurate estimates of the magnitudes of the prediction errors. The influential observations and/or outliers can be removed if they are merely errors or if they constitute events that are likely to recur in the future.
Assumption # 7: There are no linear dependencies among independent variables.
Diagnostics used to check assumptions: N/A
Solution: Drop one of the variables.