Data Augmentation

Data augmentation is an approach that has been implemented by Royle, Kéry, and a number of others, especially in Bayesian context and is based on some very simple ideas involving a) latent variables and b) visualizing "data" that potentially could exist, but have not actually been observed, and then modeling these 'virtual data' along with our real data. Data augmentation has a number of important applications relevant to us, including abundance and species richness estimation, and is easily implemented in BUGS and JAGS. To illustrate, take a simple closed abundance CMR problem where we observe (i.e., capture) 10 animals over 4 occasions. We might have capture histories something like this:

1010

1111

1000

0010

1001

1110

1011

1010

0010

0011

Just looking at these data we know 2 things about the problem (if we can assume the population is indeed closed over the 4 occasions). First, we know that there are at least 10 animals in the population, so a lower bound on N is 10. Second, we know that capture probability p <1, otherwise all the capture histories would have been 1111. If we take the simplest possible model for this we have these 2 parameters, N and p, assumed constant and homogeneous, and we want to use the data to estimate these parameters (or to make posterior inference if we're being Bayesians).

There are a number of MLE ways to do this, e.g., as implemented in MARK, and we could form a Bayesian model. If we did the latter, we would have to construct a model with priors on p and U=N-10, where U is the unknown number of unmarked animals. That is actually fairly easy, and for very simply problems leads to a nice decomposition of the model into conditionally independent pieces that can be sampled via Gibbs sampling, something we don't do here but which is far more efficient (when it works) than Metropolis-Hastings (which ordinarily is what BUGS and JAGS use although they actually use a sort of hybrid of the 2). But this approach because slow, difficult, or unworkable for very large problems, especially those involving random and hierarchical effects. Also, prior distributions on U can be trick to specify and the algorithm can be sensitive to prior choice and starting values.

Data augmentation allows us to transform the problem slightly in a way that speeds things up and eliminates some of the above issues. Essentially, we are going to convert the CMR problem into a problem that looks very much like an occupancy problem. We do this by adding an arbitrary number of rows to the capture data matrix, with all zeros, representing animals that we never captured but could have captured had they been in the population. How do we know how many to add? Well it turns out not to matter too much, as we'll see, so let's just start by adding 10 to the original 10, so for now we are saying that there could be up to 10 more animals out there that we didn't capture.

1010

1111

1000

0010

1001

1110

1011

1010

0010

0011

0000

We then model this structure in 2 ways. First, we model what we can call an inclusion probability, omega, which is the probability that an "animal" (i.e., one of our rows) is available for capture. Based on this we model the latent variable, z[i], which is the state that "animal" i is included in the population, and finally we model the probability of capture at each occasion of an "animal" as z[i]*p, meaning that of course if an animal is not available it cannot be captured. A few points:

Notice that we have transformed this problem from one where we have p and U as unknowns, to one in which the unknowns are p and omega. This is important, because U is discrete and technically unbounded above, whereas omega (like p) is bounded on (0,1)
The total number of rows M of the augmented data array is M = r +aug where r = the number of animals captured and aug= the number of rows added in augmentation, in this case M =20. Therefore uniform(0,1) prior on omega is equivalent to a prior that N is between 0 and M, and the expected value of N is M * omega.
If this looks like an occupancy problem it is no accident. The number of rows M is analogous to the number of sites in occupancy, with inclusion analogous to site occupancy. As with occupancy and the number of occupied sites, abundance is provided as a derived quantity, simply by summing the z[i] across animals .
Selection of M tends not to be problematic and can initially be made as an educated guess. If we guess too low, the posterior values of omega and N will look truncated on the right, indicating we should have made M higher. Making M too high is computationally inefficient (since all those z[i]=0 do not contribute to N) but is otherwise not a problem per se , unless taken to extremes. When in doubt (as we often are with heterogeneity problems) it is better to err on the high side. But a provisional estimate of N, say based on simpler MLE model, will often get you close.

Example

I modified an example from Kéry and Schaub section 6.2.4 where data are simulated under a heterogeneity (Mh) model and then JAGS (with the jagsUI package in R) is used after data augmentation to provide inference. Note that as indicated in the comments, the data structure can be simplified to a series of binomials (one for each animal), since we are assuming for now that there is no temporal variation in p. That is, for each animal, the outcome is number of occasions x of T at which the animal (or pseudo-animal) was captured. In any case, you can see that virtually the whole posterior distribution of N is to the right of the number of animals caught (around 80) and that the posterior mean of N is around 100, close to the true value, but the distribution has a long right tail-- something to be expected when there is a lot of uncertainty due to individual heterogeneity. There may be a number of animals out there with a very low probability of being captured, and you are never going to get them in your sample. Oh well.

By the way, notice that this is in fact just another random effects model, but this time the random effects are with respect to individual animals, not time occasions. So there are a lot of them, one reason these things are so darned slow to run.

Next: Review exercise 2- Bayesian analysis of CMR data

Page updated

Google Sites

Report abuse