Panchavarnam Baskaran - Extra activity 5 -week 6

Predicting Hospital Length of Stay

This time, I want to explore the Negative Binomial Distribution. Remember when some one is admitted in hospital, every day she/he is hoping to get discharged. He tries n number of days till he succeeds. This data perfectly suits to be modeled as NBD.

There is a data set from Microsoft . I got this from Kaggle .This dataset has 100k data points on patients admitted into hospital, indicators of their health condition and how long they were admitted in the hospital.

The Google sheet is here. This data of 100k patients gives the length of stay with a mean of 4 and variance of 5.57 making it an ideal candidate for NBD because of over dispersion.

I used few lines of R code to fit it in NBD to find that it fits with these parameter:

Estimate: The estimated value of the intercept. In this case, it's log⁡(μ), where μ is the mean of the negative binomial distribution. This value is : 1.386552 giving the value of log(1.386552) = 4.001030688 as average number of days of stay. Very close to actual average of 4 computed in the Google sheet.
Std. Error: The standard error of the estimate.
Theta (fit$theta): This is a parameter of the negative binomial distribution that controls the dispersion. Specifically, the variance of the distribution is given by: Variance=μ+ μ^2 /θ

Interpretation

Histogram with Fitted Distribution: The histogram shows the observed frequency of hospital stays. The red line represents the fitted negative binomial distribution. A good fit would mean the red line closely follows the shape of the histogram.
Q-Q Plot: In the Q-Q plot, if the data follows a negative binomial distribution, the points should lie approximately along the red 45-degree line. Deviations from this line indicate discrepancies between the observed data and the fitted distribution.

The Google Sheet has the model generated data and plotted over the actual observations.

The comparison of actuals and the model is below . The bar is from data and the red line is the model generated. The R code used is also given in the Google sheet. This Google Doc has more analysis.

The alternate model of Poisson distribution was tried and compared with NBD using Chi-squared test which rejected the Null hypothesis. Full details are in Google sheet and Doc.