Once we have estimated parameters under a particular model, we should see whether the model appears to fit the data. Lack of fit can indicate violation of model assumptions, and lead to biased estimation. There are several approaches to goodness of fit that have been applied to CMR analyses.
Expected value approach
In the simplest approach, we first compute maximum likelhood estimates of the model parameters, and then, taking these as the parameter values, compute expected values of the data (or summaries of the data). Then we compare the observed to the expected values and compute a test statistic (such as chi-square). For example, with the Lincoln-Petersen model, we have 3 parameters: N, p1, and p2. We can compute maximum likelihood estimates of these (N^, p1^, p2^) and then use these to compute expected values for the 3 observable capture history frequencies:x10, x01, and x11 for the numbers observed the first occasion only, second occasion only, or both occasions. The expected values would be computed as
E(x10)= Np1(1-p2)
E(x01)= Np2(1-p1)
E(x11)= Np1p2
with the MLEs N^, p1^, p2^ replacing N, p1, and p2. We then form a chi-square statistic by computing
(Observed-Expected)^2/Expected
and then summing these values. This statistic should follow a chi-square distribution with df= number of observation cells- no. parameters estimated. Obviously though for the L-P that yields 0 df, so there is no meaningful test, but the chi-square values will still give us an idea if the model is "off" or not (the value should be around 1 or less). I have implemented this for the earlier rabbit data here. If we constructed a different model, say assuming p1=p2, then we would have test with 1 df and a computable p value.
Contingency tables
For larger problems (k>2 capture occasions) the above approach tends to fall apart because there are many possible capture histories (e.g. , 1023 if k=10) but most of these typically will have frequencies of 0 or 1. An alternative is to form various summaries of the data and apply a series of tests based on contingency tables. This is more or less the approach we will use later for CJS models, and is implemented in the program RELEASE. This approach can also be useful for closed models, but mainly for testing for heterogeneity between samples (e.g., evidence that capture probablities differ among study areas).
Deviance approach
The deviance, computed as -2ln (likelihood), is a measure of fit that by definition is minimized under maximum likelihood. Theoretically, the deviance is distributed according to a chi-square distribution under the null hypothesis that the model is true, with degrees of freedom that will vary according to the data dimensions and number of parameters. Deviance/df is a measure of lack of fit (or overdispersion) and should be close to 1 if the model assumptions are being met. Deviance is automatically computed for all our MLE models, but tends to over-state lack of fit in practice, leading to the more computationally intensive bootstrap approaches discussed below.
We can illustrate the use of the deviance as a measure of fit by returning to the edwards.eberhardt example and fitting the null (M0) model. Focussing on just the deviance and related output we get
> deviance<-m0$results$deviance
> df<-m0$results$deviance.df
> pvalue<-1.-pchisq(deviance,df)
> print("chisq, df, p")
[1] "chisq, df, p"
> print(c(deviance,df,pvalue))
[1] 364.8259 42.0000 0.0000
> print("c-hat=deviance/df")
[1] "c-hat=deviance/df"
> print(deviance/df)
[1] 8.686332
This seems to indicate a strong lack of fit to this model (significant chi-square value, deviance/df >> 1). The code to perform these computations in R is here. We will return to this example and use the alternative boostrap method to assess fit, below.
Parametric bootstrap approach
As noted above, the deviance or other approaches sometimes overstate lack of fit. As an alternative, we can use a computer-intensive approach based on simulation called parametric bootstrapping. The basic steps are:
Select a fit statistic such as the deviance and compute its value for the data at hand under a specified model (e.g., M0 the null model for simplicity)
Estimate the paramters of this model using the data and compute maximum likelihood estimates of the parameters, as well as a value for the deviance
Then, assuming that the specified model is correct and that the estimated parameter values are the true parameter values for this model, simulate capture histories. In the simple example this would mean generating N rows of captures histories (one for every animal in the "population" and then flipping a coin with probability heads = p for capture for each of the k capture occasions and assigning 1 for captured (heads) and 0 for not captured (tails)
Using the simulated capture histories as data to compute estimates of N, p and deviance
Repeating this process a large number (say 1000) times to generate a distribution of deviance values.
Finally, comparing the observed deviance (from your data) to the distribution of simulated deviances to see of the observed is "unusually large" (e.g., > the 95% percentile from the simulated). If it is, then we conclude there is lack of fit, otherwise not. We can also compute a boostrap c-hat statistic by (observed deviance)/ (mean simulated deviance), and judge if this value is much larger that 1.
We can illustrate this appoach by returning to the edwards.eberhardt example. The results based the data provide us with estimates of N, p and deviance of 96.25 (this will round to N=96 for the simulation), p=0.082 , and deviance =364.83. The values of N and p are used to simulate data for each of the 96 animals and 18 capture occasions as Bernoulli outcomes with probablity of success p=0.082. Each simulated data set is run through RMark to compute estimates and deviance, and these are summarized (means and 95% interval). Here are results for 100 simulations:
[1] "simulated deviance results"
> x1
$dev_mean
[1] 296.7123
$dev_l
2.5%
234.662
$dev_u
97.5%
354.9116
>
By comparison the observed deviance is a bit higher than the upper 2.5% tail of 354, indicating some lack of fit:
> print("observed deviance")
[1] "observed deviance"
> deviance
[1] 364.8259
We can also examine the ratio of the observed deviance to the mean simulated value and compute a bootstrap c-hat value:
> print("bootstrap c-hat")
[1] "bootstrap c-hat"
> deviance/x1$dev_mean
[1] 1.229561
Here is the code to perform these calculations
Note that the simulation will be more reliable as the number of replicates gets large (1000 or more); however this will tend to be slow, since the data have to be simulated and the model run each time. Try running for 1000 simulations to see if you get different results.
Things to keep in mind
Although it is a good idea to see if a particular model fits the data, goodness of fit is no panacea.
Just because a model fits the data, doesn't mean it is "true". This will particularly be the case when data are sparse-- often very simple models will bit but only because the data cannot support more complexity.
Conversely, closer fit of a model to the data is not always a good thing. Remember, we can make a model fit arbitrarily well just by adding parameters (which will make the likelihood go up and the deviance go down). But we pay a price for adding too many parameters-- by increasing the variance of the estimates. This is where AIC comes in-- balancing fit (which reduces bias by adding parameters) and variance (reduced by taking away parameters)
The above ideas lead to a general approach along these lines:
See if any model will fit the data reasonable well, among your set of candidate models. Often, this will be a general (so-called global) model that contains many parameters
Compute AIC for the models in the model set (including the "global model") to see which one(s) perform optimally (have lower AIC)
Use either the lowest AIC model, or (better) compute model-averaged parameter estimates using the AIC weights.