Commonly we have multiple data structures that may be related to the same parameters we are interested in modeling in the population. For example, abundance (N) may appear either explicitly or implicitly in the model structure for 2 data sets. If we can reasonably assume that both samples represent the same (or at least overlapping) populations, then we should be able to "share" information from the 2 (or more) data structures to get inference. This can be particularly useful when one type of data is relative more expensive or difficult (but perhaps better at estimating the parameter) than the other type of data. In such cases it can be more efficient to take a larger sample using the "cheap" data, but augmented this with a sub-sample on which the "expensive" data are also collected. This general approach, known as double sampling, is very common in sample survey design and may already be familiar to you (think: aerial photos and timber cruising; or aerial surveys of ducks with calibration via ground counts).
Example- Integrating occupancy and abundance estimation
We can illustrate this general approach to connect methods we are already familiar, namely occupancy sampling, in the one hand, and other methods directed at abundance estimation, such as distance sampling or CMR. Suppose we have a very large number of sites (or equivalently a large area) over which we are interested in estimating total abundance. Potentially we could sample all (or a very large fraction) of the sites, and for each site use an unbiased procedure (e.g., very accurate counts, double observer counts corrected for detectability, distance sampling, or closed CMR models) to estimate N, and then total these estimates to get an estimate of overall abundance. Such an approach would work well in theory, but in practice is likely to be limited to a small number sites due to cost, time, and other limitations.
Alternatively, we could sample all of the sites (or a very large fraction) using a occupancy sampling, which would require us to form replicates to deal with detection, but otherwise should be much cheaper, since we only have to detect 1 or more animals to obtain a detection, and detection can be via sign, etc. (vs. for example capturing and marking animals in a CMR study). On sites with 1 or more detection, we know that N >0, but that's it. Or is it? In fact, maybe we know more, as we'll see.
Suppose that we get detection data (say for 5 replicates on each site) over 20 sites (our sample should really be bigger,this is just for illustration). We might have data like this:
detections<- 4 3 2 2 4 2 4 2 3 3 3 5 4 3 4 4 2 3 5 0
Suppose that on the first 10 of these (we will assume that the plots are randomly ordered, so the fact that these are the first 10 doesn't matter) we could very accurately observe abundance, but we have no data for the last 10. Our abundance data looks like this:
N<- 81 110 85 106 149 94 75 49 143 64 NA NA NA NA NA NA NA NA NA NA hey hey, good bye.
We are interested in inference on N on all 20 sites, but 10 are 'missing'. We will use a Bayesian model to fill in the rest and to get inference on the total across all sites. We do this by taking advantage of the relationship between abundance and detection under the Royle-Nichols model:
p[i] = 1-(1-r)^N[i]
where p[i] is the detection probability for a single replicate sample when abundance is N[i]. The observed data y[i] are numbers of detections in k replicate sample, are used with a binomial likelihood
y[i]~dbin(p[i],k)
where p[i] is determined from the Royle-Nichols relationship above. Under this model, we expect more detections where abundance is higher, and fewer where it is less. Essentially we are going to estimate r (the per-animal detection probability) using the data where we have both binomial detections y[i] and abundance data N[i], and then predict the 'missing' N[i] via the model, and finally to get total N by summation. We can also impose prior assumptions on N[i], in this case allowing for considerable site-to-site heterogeneity in counts by a Poisson-gamma mixture model. The code to simulate the abundance and occupancy data and fit the Bayesian model is here.
Conroy et al. (2008) used this basic approach, in an adaptive, 2-phase sampling approach. The basic idea was that in the first phase we would take occupancy samples on all sites, and if the number of detections exceeded a threshold, go ahead and do CMR sampling. The Conroy et al. (2008) approach is more elaborate, because instead of assuming that the N[i] are observed perfectly (as in the simple example here), the N[i] are updated by CMR data on sites where these exist. Conroy et al. showed that this approach can be more efficient than attempting CMR sampling everywhere, especially where abundance is very heterogeneous, leading to a situation where some sites are very sparsely occupied (and where CMR sampling would yield very poor results).
Combining data structures - (2) joint recapture and recovery data