Simulation has many uses, including:
Goodness of fit testing
Study design
Evaluation of estimator accuracy and robustness to assumption violations.
Simulation for goodness of fit
We have already seen simulation applied to evaluating model fit, when we discussed bootstrap simulation earlier. The approach was illustrated by the Dipper goodness of fit example. To summarize the steps again quickly:
A data set is fit to a general ("global") model such as the Phi(g*t)p(g*t) model in CJS and the deviance or other measure of model discrepancy is computed.
The estimates are treated as fixed parameter values and used to simulate data of the same dimensions (number of groups and occasion) under the general model; a deviance is calculated for each simulated data set
The mean and quantiles of the simulated deviances are compared to the value from the sample (real) data set. If the sample estimates are larger than expected by chance by comparison to the simulated distribution, the model is said to lack fit.
Often the mean simulated deviance divided into the sample deviance is taken as an estimate of c, the variance inflation factor, and is used to adjust likelihoods, variances, and AIC values.
These steps are carried our for the Dipper example in the script below.
Simulation for study design
Simulation can also be a very important tool for study design, by helping us to quickly visualize data that could arise from a given design, and the implications for estimator accuracy and precision. The basic steps are illustrated here for a closed CMR analysis, focusing on estimation of abundance.
Specify estimated values for the parameters of the population to be sampled. Often this is based on a preliminary study, e.g.,a study providing estimates of N approx. 100 and p approx 0.3
Specify how the sample data are distributed. Usually this involves making assumptions about statistical distributions, homogeneity of parameters, etc.
Specify the model under which parameters will be estimated. This model can have the same assumptions as the generating model (the model used to simulate the data) or it may be different. For example, we might generate data under a model in which p varies among individuals, but estimate N under a model that assumes p is homogeneous.
Estimate the parameter(s) of interest from the simulated data and keep track of their values in a data frame
Repeat some large number of times and summarize the results, e.g.:
Mean estimates (compare to true value: difference = bias)
Standard deviation - empirical estimate of sampling variation
CV =SD/mean = measure of relative precision
The attached code performs these steps for an example of closed CMR to estimate N. In this example, the generating model and estimating model are the same: p is assumed constant. We could have changed those assumptions, for example, p could have followed a random distribution across individual animals (see example of how to do this at bottom of code).
For comparison I provide an approach based on the expected values of the constant p model. The results are very similar to the simulation based results when number of simulation replications is 100 or more, which is as expected. Since the expected value approach is much faster (virtually instantaneous vs. several minutes) you are probably wondering why we bother with simulation. Several reasons:
The expected value approximation only works when we can assume the constant p model. If we have a situation where there is likely variation due to behavior, temporal factors, or individual heterogeneity it is not the appropriate approach
It requires us to assume the estimation model and the simulation model are the same (again, constant p), not a requirement for simulation. With simulation we can generate data under one set of assumptions and estimate parameters under different assumptions (or several models, each with different assumption)
So, the expected value approximation is a good start to get you in the ballpark, but almost always you will want to do some simulation under more realistic assumptions.
Simulation for estimator robustness
Often, we are interested in "what happens" if we use a particular model (say M0) to fit the data, but the data really are better described by a different model (say Mh). In this specific case, we would say that "the assumption of homogeneous p is being violated and p exhibits a degree of individual heterogeneity." We might be more specific, and ask what happens if p has a specific amount of heterogeneity, say that represented by a variance and a statistical distribution like the Beta. By "what happens" we generally would mean "what happens to the bias and precision (or together, the accuracy) of the estimates. We can evaluate this using simulation and the following basic steps:
Specify the model we think could represent that data, say a model with heterogeneous p
Generate data under the above model
Use these data to estimate parameters under the model whose robustness (sensitivity to assumption violation) we are trying evaluate
Compare the estimates to the real value of the parameters
Replicate and compute bias, mean square error, and variance.
I have applied this approach to a closed CMR example, evaluating robustness of M0 when Mh is the "true" model. See the attached script.
Simulation of different types of CMR data
The above examples illustrate simulation for specific data structures / models (CJS, Closer CMR) but obviously
CMR (and more broadly, statistical inference in general) deals with many other types of data. However, the generic problem is the same. As long as we can define the statistical likelihood, we can generate data under assumed conditions/ parameter value, and proceed with bootstrapping, sampling simulation, or evaluation of robustness. Elsewhere I have assembled code that provides Monte Carlo simulation for common data structures. This code can be readily adapted to specific needs, whether these are for bootstrapping, evaluation of alternative sampling designs, or evaluation of estimator robustness under assumption violations.