Not all count data sets that have a lot of zeros are necessarily zero-inflated, i.e. have too many zeros that conventional distributions to model counts, e.g. Poisson or negative binomial (NB) will only be able to fit the data poorly.
Zeros can be true/structural zeros or excess/sampling zeros.
What to do? Which model to use?
Check if the data is "zero-inflated."
If yes, should you use a zero-inflated models or hurdle models?
Fit the models that you think are theoretically relevant below (you will only know which model is better after fitted them).
Compare model fits uisng AIC (see Feng 2021 for other possible metrics)
can give more accurate predictions when data has lots of zeros
can be used to characterize zeros separately from non-zeros (positives)
computationally more intensive than traditional models
need larger sample sizes for reliable parameter estimates
interpretation can be complex/require expertise
if data not truly zero-inflated, can potentially lead to overfitting and biased results
Just because you have a lot of zeros, does not necessarily mean your data is zero-inflated.
Plot histogram of your counts. Fit models and check if predicted zeros are close to actual number of zeros in the data. If the expected/predicted zeros are a lot fewer than the actual zeros in your data, it might not necessarily mean that your data is zero-inflated (maybe you're missing a covariate or mismodeling some other way?) -- but likely. It's a bit of a judgment call.... Often, the strategy is when you suspect your data to be zero-inflated, fit a variety of models and compare model fits.
There are two families of models to help model zero-inflated data:
Zero-inflated (ZI) models
Hurdle (zero-altered) models
Both families model the data as a mixture of two distributions:
count component: a distribution that handles structural/real counts (with distributions such as Poisson or negative binomial) and
zero component: another distribution for "zeros only". This is most commonly described using the binomial distribution (for both families).
These families differ in how they treat zeros: ZI models allow the source of zeroes to be "real data" or otherwise, while hurdle models assume that zeroes occur from one source. Technically, zero-inflated models have an additional probability mass in the count component to account for zeros, while hurdle models are pure mixtures of zero and non-zero outcomers.
ZI models cannot handle zero deflation (fewer zeroes than expected), unlike hurdle models.
Zero-inflated models allow there to be two separate processes that underlie zeroes and (positive) counts. They allow zeros to either be true/structural (and therefore handled with positive counts with an untruncated count distribution) or modeled by a distribution that only allows for zeroes. A logistic regression model (i.e. Bernouilli process) is used to predict which distribution the zero belongs to. Allowing for separate processes theoretically might make sense. To take an example from The Analysis Factor, if you were modeling the number of alcohol drinks consumed daily, you would have some participants who drink but consumed none a particular day by chance -- and it would not make sense to group that "zero" with another participant who simply never drinks.
Two common models in this family are
zero-inflated Poisson distribution (ZIP) (Lambert 1992)
basically the Poisson distribution (<img src="http://www.google.com/chart?cht=tx & chf=bg,s,FFFFFF00 & chco=000000 & chl=\frac{\lambda^ye^{-\lambda}}{y!}" /> for y>0) weighted by (1-α), where α = the probability of excess zeroes
assumes mean = variance
zero-inflated negativel binomial (ZINB)
combines the NB distribution and the logit distribution
a generalization of the ZIP allowing for data (under-/over-)dispersion, i.e. does not assume mean = variance
like ZIP, the mean is modeled with a Poisson distribution while the dispersion/shape of the distribution is modeled with a gamma function
Both ZIP and ZINB model the probability of excess zeroes with a binomial distribution.
Anecdotally, ZINB models tend to fit zero-inflated data better than ZIP. Anecdotally, there also might not be much of a difference in fits between ZINB and standard NB models (as indicated by likelihood ratio tests and visually comparing model predictions and actual data graphically), but theoretically, it might make more sense to adopt one or the other.
According to Zuur et al., if no zero-inflation use ZIP. If zero-inflation, use ZINB.
Hurdle models (Mullahy 1986; Heilbron 1994) treats all zero data to be from the same structural source. It is a mixture model comprising a truncated count distribution (e.g. truncated Poisson or truncated NB or geometric distribution), and the zero component model is typically either a binomial, censored Poisson, NB, or geometric distribution.
Overview: Poisson or Negative Binomial model might still be able to deal with data with many zeros, depending on the covariates in the model, so first fit these standard regression models to help calculate check dispersion. If data is not overdispersed and these standard regression models fit reasonably well, they are simpler and easier to interpret. Otherwise, fit zero-inflated and/or hurdle models and compare all fitted models.
After each model fit, always check the usual fit diagnostics:
par(mfrow = c(2,2)) ## lay out the 4 diagnostic plots at a glance
plot(model)
And to show model results:
confint(model) ## coeff estim w/confidence intervals
summary(model) ## usual model info and
sjPlot::plot_model(model) ## forest plot of coeff estim
plot(ggeffects::ggpredict(model, terms = c("X1","X2")))
Or save model estimate table as an HTML page:
sjPlot::tab_model(model, file ='file.html')
For your count variable Y in your data dataX:
nbins <- 100 ## modify how many counts to bin together
ggplot2::ggplot(dataX, aes(Y)) +
geom_histogram(bins = nbins)
Is variance > mean?
var(dataX$Y)
mean(dataX$Y)
Package performance also has an overdispersion test/plot (but need to fit a Poisson or NB regression model first -- see below)
If the variance is much larger than the mean, skip this part. If they are roughly the same, then try fitting a Poisson model.
## using R core libraries
m.pois <- glm(Y ~ X1 + X2,
data = dataX,
family = poisson(link = "log"))
## IMPORTANT: check in output for the line
## "Residual deviance: ... on ... degrees of freedom"
summary(m.pois)
For Poisson models, check if the residual deviance divided by the degrees of freedom is larger than 1, an alternative model like the quasi-Poisson (which has an dispersion parameter) or a negative binomial should be considered as it means that the standard errors from the Poisson model are not accurate (and hence neither are the coefficient estimates).
At this stage, we can also run the overdispersion test from Gelman and Hill (2007, p.115) with check_overdispersion from the performance package -- p-value < 0.05 indicates overdispersion:
performance::check_overdispersion(m.pois)
or we can do this manually:
E2 <- resid(m.pois, type = "pearson")
N <- nrow(dataX)
p <- length(coef(m.pois))
sum(E2^2) / (N - p)
Alternatively, the DHARMa package (Residual Diagnostics for Hierarchical Multi-Level / Mixed Regression Models) also has a testDispersion() function, which tests for data dispersion by simulation, by default, or the same test as performance::check_overdispersion(). To use the simulation version:
m.pois.sim <- DHARMa::simulateResiduals(m.pois)
DHARMa::testDispersion(m.pois.sim)
plot(m.pois.sim)
Also good to visually check for overdispersion by plotting the standardized deviance residuals. A good-fitting model will have most of the points between [-2,2]. Usually when have variance > mean, i.e. overdispersion, more observed data points will lie above the mean=variance line.
plot(log( fitted(m.pois) ),
log( (dataX$Y - fitted(m.pois))^2 ),
xlab = expression(hat(mu)),
ylab = expression((y-hat(mu))^2))
title('Overdispersion check: Poisson model')
abline(0,1) ## variance = mean line
The quasi-Poisson is way of correcting the standard errors, not a distribution. It is basically a Poisson but weights the data with an estimated dispersion parameter τ, where τ > 1 indicates overdispersion and τ < 1 indicates underdispersion.
The problem with quasi models is that there is no maximum likelihood method and hence no IC or dispersion tests (not even with DHARMa).
m.qpois <- glm(Y ~ X1 + X2,
data = dataX,
family = quasipoisson)
summary(m.qpois)
This returns a dispersion estimate (φ; if > 1 then data is estimated to be over-dispersed)
For more robust coefficient estimates , we can re-compute the Wald tests using sandwich standard errors.
Cameron and Trivedi (2009) recommend using robust standard errors for the parameter estimates than the default Wald tests returned by glm() to control for mild violation of the distribution assumption that the variance equals the mean. We can use lmtest's coeftest():
lmtest::coeftest(m.qpois, vcov = sandwich)
or use package sandwich to obtain the robust standard errors (take the square root of the variances of the coefficients, which are on the diagonal of the variance-covariance matrix, to get the standard error) and calculate the p-values accordingly:
## calculate the variance-covariance matrix
m.qpois.swcov <- sandwich::vcovHC(m.qpois, type="HC0")
## variances of the coeffs are on the diag, std.err = sqrt(var)
std.err <- sqrt(diag(cov.m1))
## print out info and calc p-value
sn.nblks.pois.robustEst <- cbind(Estimate= coef(sn.nblks.pois),
"RobustSE" = std.err,
"Pr(>|z|)" = 2 * pnorm(abs(coef(sn.nblks.pois)/std.err), lower.tail=FALSE),
"95%LL" = coef(sn.nblks.pois) - 1.96 * std.err,
"95%UL" = coef(sn.nblks.pois) + 1.96 * std.err)
The NB distribution reputedly handles over-dispersion more completely.
m.nb <- MASS::glm.nb(Y ~ X1 + X2,
data = dataX,
link = "log")
summary(m.nb)
Anm alternative to the NB model that also is designed to handle data dispersion problems is the CMP (Conway & Maxwell 1962), which is a generalized It also is geometric or Bernouilli. CMP handles both over- and under-dispersion with two parameters: rate parameter λ and dispersion parameter ν. It is part of the exponential family and becomes Poisson when ν = 1, over-dispersed when ν < 1, under-dispersed when ν < 1.
The CMP is a special case of loglinear models (like ordinary Poisson regression) where the log of the dependent variable is expressed as a linear combination of the factors, so need to back-transform exp(Estimate) for interpretation. The variance of the CMP includes a dispersion parameter.
sn.nerrs.cmp <- mpcmp::glm.cmp(Y ~ X1 + X2,
data = dataX)
broom::tidy(sn.nerrs.cmp) ## print model estiomates in a tidy way
summary(sn.nerrs.cmp)
How many zeros are in our data?
sum(dataX$Y == 0)
For instance, take the NB model.
## grab est'd means:
m.nb.preds <- predict(m.nb,
type = "response")
## est'd dispersion/shape param of the NB distrib:
m.nb.theta <- summary(m.nb)$theta
## the distrib of each observation, using the 2 est'd NB params
## if this were a Poisson model, then use
## dpois(x = 0,
## lambda = predict(m.poiss, type = "response") ) ) )
m.nb.prob0 <- dnbinom(x = 0,
mu = m.nb.preds,
size = m.nb.theta )
The number of zeros predicted by this model is the sum of the probabilities (last line of code above):
round( sum(m.nb.prob0) )
This number should be about the same as the number of actual zeros in the data.
If not, consider treating the data as zero-inflated and use a zero-inflated model or hurdle model.
ZIP also assumes that mean=var -- so make sure that assumption holds in your data.
Example call:
m.zip <- pscl::zeroinfl(Y ~ X1 + X2,
data = dataX,
dist = "poisson" ## "negbin" or or "geometric")
)
Does it handle the number of observed zeros?
pr <- predict(m.zip,type="zero") # ?
mu <- predict(m.zip,type="count") # ?
zip <- pr + (1-pr)*exp(-mu) # also predict(m.zip,type="prob")[,1]
mean(zip)
sn.df %>% select(Sex,Grp) %>%
cbind(.,
Count = predict(m.zip, type = "count"),
Zero = predict(m.zip, type = "zero") ) %>%
group_by(X1,X2,Count,Zero) %>%
summarise(count_min = min(Count),
count_max = max(Count),
zero_min = min(Zero),
zero_max = max(Zero)
)
Example with hurdle Poisson:
m.hurdle_pois <- pscl::hurdle(Y ~ X1 + X2,
data = dataX,
dist = "poisson", # or "negbin"
zero.dist = "binomial")
By design, the hurdle model will always predict the same number of zeros observed. We can also predict the expected mean count using both components of the hurdle model. In a hurdle model, the expected count given the predictors is a product of two things: a ratio and a mean. The ratio is the probability of a non-zero in the zero process (logistic distribution) divided by the probability of a non-zero in the second untruncated process (Poisson or NB distribution). The mean is for the untruncated version of the positive-count process. For more details on this expression, truncated counts, and hurdle models in general, see Cameron and Trivedi (2013).
To get the mean count of both components of the hurdle model, extract the ratio and the mean by specifying the argument 'type' for the first 5 observations, multiply them and confirm they equal the expected hurdle count:
# multiply ratio of non-zero prob and mean for untruncated process
predict(m.hurdle_pois, type = "zero")[1:5] * predict(m.hurdle_pois, type = "count")[1:5]
# is the above equal to the hurdle model expected count below?
predict(m.hurdle_pois, type = "response")[1:5]
Check Clay Ford's RPub to see how the hurdle ratio is calculated.
Like with all regression models, check that model assumptions are met and fits are decent.
For count regression models, Tukey (1977) also suggested plotting hanging rootograms. In these plots, the observed frequencies are plotted as bars, which hang from a curve representing the model fit. When the base of bars are below zero on the y-axis (sqrt of frequency of that value), then the model underpredicts that value, suggesting overdispersion (bar above zero = overpredicts that value).
## Need to install countreg pkg from R-Forge instead of CRAN
## install.packages("countreg", repos="http://R-Forge.R-project.org")
maxct <- 20 ## up to how many counts?
countreg::rootogram(mod.hurdle, max = maxct)
Create a list of all your fitted models, e.g.
ms <- list("Pois" = m.pois,
"QPois" = m.qpois,
"NB" = m.nb,
"ZINB" = m.zinb,
"Hurdle-Pois" = m.hurdle_pois)
to facilitate model comparison -- need to assess on a variety of dimensions:
estimated regression coefficients
List the coefficient estimations:
sapply(ms, function(x) coef(ms))
How similar are the estimates? Package sjPlot also allows you to quickly plot the estimates from different models:
sjPlot::plot_models(m.pois, m.qpois, m.nb, m.zinb, m.hurdle_pois,
show.values = T,
show.p = F,
p.shape = T) +
theme_bw()
estimated standard errors
Also recalculate estimations based on sandwich estimation for the Poisson model:
cbind("Pois" = sqrt(diag(vcov(m.pois))),
"PoisSW" = sqrt(diag(sandwich(m.pois))),
sapply(fm[-1], function(x) sqrt(diag(vcov(x)))))
model likelihood
Theoretically, the higher the likelihood value, the higher the likelihood the model can generate the data (i.e. the better the model fit), but this is also disputed...
rbind(logLik = sapply(ms,
function(x) round(logLik(x), digits = 0)),
DoF = sapply(ms,
function(x) attr(logLik(x), "df")))
Check for any models with particularly low log-likelihood values, which would indicate a bad fitting model. Not sure why cannot calculate likelihood for quasi-Poisson and sandwich-adjusted Poisson models....
Do models predict as many observed zeros?
round(c("Obs" = sum(dt$ofp < 1),
"pois" = sum( dpois(0, fitted(m.pois)) ),
"NB" = sum( dnbinom(0, mu = fitted(m.nb),
size = m.nb$theta)),
"ZINB" = sum(predict(m.zinb, type = "prob")[,1]),
"Hurdle-Pois" = sum(predict(m.hurdle_pois, type = "prob")[,1])
)
)
Model comparison metrics
AIC answers question of whether it is worth having all the extra parameters. Likelihood ratio test (LRT) answers whether it is worth having any of the extra parameters.
AIC and BIC have different uses for different conditions -- AIC for tapering effects and BIC for when there are no effects at all or a few big effects and all others are zero effects (no intermediate effects, no tapering effects) (Burnham & Anderson 2004). BIC is better than AIC when there is more heterogeneity.
Lots of ways to do this in R.
Get several metrics AIC, BIC, LR at once:
vcdExtra::LRstats(m.pois, m.qpois, m.nb, m.zinb, sortby="BIC")
## alternatively:
performance::compare_performance(m.pois, m.qpois, m.nb, m.zinb)
zero-inflated data
when the model distribution cannot handle the number of zeros in the data
excess zeros
zeros in the data that occur in excess of a possible process, i.e. zeros that occurred due to randomness study design, sampling, or testing error.
structural / positive / true zeros
zeros in the data that are meaningful, i.e. not due to error or sampling.
over-dispersed data
data that has a (conditional) variance that is greater than the (conditional) mean. This is common in count data sets and can happen because of heterogeneity of the sample studied/nature of the dependent variable, clustering, important missing covariates in the model, or outliers. Poisson regression assumes that mean = variance, unlike negative binomial (NB) regression which explicitly models dispersion in the data.
log-likelihood
a metric of model fit. The natural logarithm of the likelihood, or the joint probability (i.e. probability density) of the observed data having been generated from the parameters of a statistical model. The log is taken because it transforms the product of the densities into a sum, which is mathematically more convenient and numerically more stable (no need to worry about difficulties in distinguishing zero from very small numbers or very large numbers from infinity which can be encountered because the maximum likelihood problem is often solved numerically)
residual deviance
= -2*loglikelihood
a measure of how well the response variable can be predicted by a model with p predictor variables. The lower the value, the better the model.
Chi-squared statistic for models = null deviance - residual deviance (where null deviance is when the model only has an intercept term and no other predictors) with the associated p-value based on p predictor variables as the degrees of freedom
## p-value of residual deviance.
1-pchisq(2*logLik(model0) - 2*logLik(model1),
length(coef(model0)) - length(coef(model1)) )
## If p is very small, check as well that we meet the rule of thumb that the number of cell counts < 5 should be <20% to take deviance test seriously
sum(fitted(model1)<5)/length(fitted(model1))
Poisson regression
regression modeling using the Poisson distribution, which gives the probability of an event happening a certain number of times within a given interval of time or space. The Poisson distribution is a discrete probability distribution characterized by one parameter, λ, which describes both the mean µ and the variance σ².
Negative binomial regression
modeling using the negative binomial distribution (NB, aka Pascal distribution), which gives the probability of the number of successes and number of failures in a given number of trials. It is often used to model count data that is over-/under-dispersed. The traditional NB regression model (NB2), is based on the Poisson-gamma mixture distribution. It is a generalization of the Poisson regression in that it has the same mean structure as the Poisson but an extra parameter to capture data dispersion.
JW Tukey (1977). Exploratory data analysis (Vol. 2). Reading, Mass.
AF Zuur, EN Ieno, N Walker, AA Saveliev, GM Smith (2009) Mixed Effects Models and Extensions in Ecology with R. Springer. [code/suppl material]