Simulation for study design

Simulation has many uses, but one of the most important is simulation for study design. The basis steps of a simulation for this purpose are:

Specify the parameters of the population to be sampled. Often this is based on a preliminary study, e.g.,a study providing estimates of N approx. 100 and p approx 0.3
Specify how the sample data are distributed. Usually this involves making assumptions about statistical distributions, homogeneity of parameters, etc.
Specify the model under which parameters will be estimated. This model can have the same assumptions as the generating model (the model used to simulate the data) or it may be different. For example, we might generate data under a model in which p varies among individuals, but estimate N under a model that assumes p is homogeneous.
Estimate the parameter(s) of interest from the simulated data and keep track of their values in a data frame
Repeat some large number of times and summarize the results, e.g.:
- Mean estimates (compare to true value: difference = bias)
- Standard deviation - empirical estimate of sampling variation
- CV =SD/mean = measure of relative precision

The attached code performs these steps for an example of closed CMR to estimate N. In this example, the generating model and estimating model are the same: p is assumed constant. We could have changed those assumptions, for example, p could have followed a random distribution across individual animals (see example of how to do this at bottom of code).

For comparison I also ran the approach based on expected values of the constant p model that we saw in Week 6. The results are very similar to the simulation based results when number of simulation replications is 100 or more, which is as expected. Since the expected value approach is much faster (virtually instantaneous vs. several minutes) you are probably wondering why we bother with simulation. Several reasons:

The expected value approximation only works when we can assume the constant p model. If we have a situation where there is likely variation due to behavior, temporal factors, or individual heterogeneity it is not the appropriate approach
It requires us to assume the estimation model and the simulation model are the same (again, constant p), not a requirement for simulation. With simulation we can generate data under one set of assumptions and estimate parameters under different assumptions (or several models, each with different assumption)

So, the expected value approximation is a good start to get you in the ballpark, but almost always you will want to do some simulation under more realistic assumptions.

Finally, as allude to above, we can do things like simulate data under one model (say Mh) and estimate under another (say M0). I did that in the example at the bottom of the code. Try running this for 100 replicates and compare to the results where the data were generated under the constant p model. You should see 2 things:

Estimator precision goes down (CV goes up)
The mean N.hat is now substantially different than N. This is bias
The upshot is that by violating model assumptions we are going to get an estimate that is biased and imprecise or inaccurate.

Finally, finally all of the above is generic: it is not specific to closed CMR for abundance estimation, or for that matter to CMR at all. It is a general approach for investigating the properties of estimates under hypothetical circumstances, under the investigator's control.

Page updated

Google Sites

Report abuse