Why Logistic vs. Linear Regression

The material below is based on an article by Robert Wolfe (attached at the bottom of this page).

Linear regression models are not usually the most appropriate models for proportions. Why?

1. Linear regression models often do not give a good approximation for the mean of a dichotomous variable because the linear regression equation puts no inherent constraints on the value of µ_{Y |x}.

- For example, the regression equation µY |x = 0.5 + 0.01 × FTB could approximate the fraction with Y = 1 in the population for FTB in the range 20 to 30, but cannot represent a fraction for FTB over 50 because it would lead to probabilities greater than 1.
- It can be especially difficult to assure that the linear regression equation yields results between 0 and 1 when the regression equation includes several independent variables.

2. Ordinary least squares statistical inference for the coefficients of the linear regression model rests on several assumptions.

- One requirement is that either the response variable has a normal distribution or else the sample size is large. Dichotomous response measures do not have a normal distribution, but this is not an important issue if the sample size is large.
- More important is the assumption of constant variance. The variance of a dichotomous measure depends upon the probability that Y = 1. In fact, if p denotes the probability that Y = 1, then the variance of Y is p × (1 − p). Least squares methods are not efficient unless the variance is constant.

3. In summary, the linear regression model often gives the wrong answer for dichotomous response data. Even when the model could give the right answer, the usual estimates are more variable than are estimates based on other methods (e.g., the logistic model).

Report abuse