Sample size determination and allocation

There are a number of ways to get estimates of sample size needed for a particular problem, and their ease and complexity depend on our assumption about the population and the sampling situation. Generically, the approaches fall into mathematical approaches, such as those based on normal theory, or simulation approaches, where we simulate data under possible designs and evaluate their performance for a candidate estimator. What we use also depends on the type of question we are asking, so for instance if we are simply interested in getting a precise (short CI) estimator, versus in designing to detect a specific difference (at a given Type I and Type II error combination).

Normal theory (Finite and infinite population sampling)

Most of your standard sample size formulas in textbooks are based on some kind of asymptotic assumptions, often normal theory. These are good for a number of problems but break down in many of our later applications, but they're good to start with. In the attached code I have built and apply a user-defined function sample_size() to estimate sample size n needed for desired precision of an estimated mean, where we are sampling from a finite population of specified size N available units (see Formula 5.1 of Conroy and Carroll 2009). Sometimes either we don't know N or assume that N is very large relative to n, in which case we simply make N a very big number. The last 2 functions in the code take a give sample n

and allocates it to strata under stratified random sampling, under either proportional (only considering the relative sizes of the strata) or optimal (also considering cost and variance of sampling if these vary among strata).

Simulation approach

Simulation offers a great deal of flexibility in modeling the data problem, and will usually be our method of choice when we get into sample size problems in more complex models such as capture-mark-recapture. The basic idea of simulation involves these steps:

Decide on the distributional assumptions of the data. These can be very complex and non-standard but we'll start with some simple 'standard' examples to illustrate
Decide on candidate values for the parameters of the population from which the data will be taken. Often these are estimates (means, sd, etc.) from a pilot sample; sometimes we will conduct an analysis and plug in the pilot estimates, all in the same code.
Select a candidate sample size n and draw a sample this size from the population
Compute the estimate with the sample
Repeat this process a large number of times.
Report summary values for the resulting distribution of estimates (mean, sd, cv, quantiles, etc).

I've set up code to simulate data and precision (CV) results for a couple of simple cases involving normally distributed data and Poisson count data, and then generalized the last of these to inject a random effect (so technically violating Poisson assumptions).

Page updated

Google Sites

Report abuse