Logistic regression explains binary outcome data, such as whether or not someone has a disease, in terms of continuous predictive variables, such as risk factors like age, weight, blood pressure, etc., by fitting a logistic curve to the data. The regression can then be used to predict the probability of having an outcome from the values of the predictive variables for any individual for whom the values of the risk factors are known.
Traditionally, logistic regressions assume that the outcome state and the values of all of the predictor variables are precisely known. We want to generalize this statistical method
to handle interval data for the predictive variables (age range, weight class, etc.).
We developed software with a straightforward algorithm that fit logistic regressions using nonlinear least-squares regression (you wiggle the location and slope parameters to find the optimal values that minimize the sum of squared vertical differences from the line to the data points). Our intention was to generalize this software to handle interval values for the risk factors. But our statistician friend Bill Huber told us that nonlinear least squares is not an appropriate way to do statistical regression analysis. The problem is that it does not deliver confidence intervals for the fitted parameters. He suggested that only some approach based on maximum likelihood would be acceptable. We need to find the modern approach to computing logistic regressions that will produce confidence intervals. Open-source software would be great, or at least a paper that spells out the algorithms.
As a follow-on question we will want to ask, what is the traditional way that interval censoring of the risk factor values has been accounted for in such logistic regressions? We will want to create an alternative to this traditional approach to logistic regression with censored data that respects our conception of intervals, i.e., that intervalizes the regression in our way so it doesn’t depend on assumptions such as independence of error among the intervals.
Finally, we will need to implement reasonably efficient software (written in any convenient language, including R) for logistic regression for interval data. The software should allow both the traditional approach to interval censoring and our starker, epistemologically purer approach for interval data. The software should also compute confidence intervals for regression parameters (which would be the same as the traditional confidence intervals when the data are precise, and somewhat wider when the data are intervals).
<<
Q and A between Scott (SF) and Masatoshi (MS)
SF: Is it always true that prediction with logistic regression improves with the number of predictors?
MS: I do not think that it is always the case.
SF: What is the best approach to assessing the patient’s risk of having disease using logistic regression when he\she only has good data for a predictor x1 but an interval for a predictor x2?
MS: We may work on the problem in two ways. The first approach would be to conduct a standard logistic regression using only x1 and ignore x2. The other approach would be to conduct a logistic regression using fixed x1 and intervalized x2. Practically, we would generate a scalar value of x2 from the interval and do a logistic regression given x1 and x2. We repeat this procedure many times until any combination of x1 and x2 is covered.
>>
<<
A standard Bayesian approach for a logistic regression is explained in separate documents (the documents are very equation-heavy and texifying the equations takes long time. For now, I put the documents for illustration of the method. There are three formats available: (1)docx, (2)doc, and (3)pdf.
A interval version of Bayesian approach for a logistic regression is explained in separate documents (the documents are very equation-heavy and texifying the equations takes long time. For now, I put the documents for illustration of the method. There are three formats available: (1)docx, (2)doc, and (3)pdf.
>>