Determining sample size needed for a single-season study
Before we begin conducting an occupancy study, we should have reasonable assurance that the data we plan to collect will provide estimates that are suitably precise for the our study goals. In an occupancy study we are mainly interested in the precision of the estimate of site occupancy (psi), or possibly in a precise estimate of the number of sites occupied in a finite study area. The precision of these estimates will, in turn, depend on a number of factors, some under our control, some not:
Actual site occupancy probability (psi)- will be assumed not under our control
Number of sites in our study area (S)- often not under our control
Detection probability (p)- sometimes under our control (as in, can be increased with increased per-sample effort). We will generally assume that p is not under our control
Number of sites sampled (s)- This may the total number of sites in our study area (s=S) or (often) a sample of s sites from some larger list of sites (s <S, finite sample) or s<S and S assumed infinitely large. We will assume that our choice of s is under our control
Number of replicates (occasions=k) per site- k is assumed to be under our control.
Other things being equal, we will have a given amount of resources (money, time), and we will need to decide how to allocate these resources. If the costs of visiting individual sites, or repeatedly sampling a given site, were the same (they are usually not, but for simplicity assume they are), then (for example) a s=100 k=2 study would take the same effort as a s =50, k=4 study-- but they would yield quite different data, with possibly very different precision in estimated psi. So, a simplified version of the study design question is :
What combinations of s (sites sampled) and k (replicates per site) will provide us with the precision needed for our study?
We will approach this problem by using a variance approximation based on expectations (McKenzie et al. 2006) that produces an estimate of the coefficient of variation (se(psi)/psi) as a function of s, k, psi, and p. We will then specify values of psi and s or psi and p and explore combinations of p and k or s and k in terms of the expected value of cv.
First let's specify a directory that will contain any input used or output created and saved.
> #set here the directory where your data files are located
> #data_dir<-"C:/Users/mike/Dropbox/company/pacblackduck/workshop/occupancy"
> #data_dir<-"C:/Documents and Settings/conroy/My Documents/Dropbox/company/pacblackduck/workshop/occupancy"
> data_dir<-"C:/mydir"
> setwd(data_dir)
I have then written a small user-defined function that produces cv for given inputs of psi, p, s, and k
>
> sample_approx<-function(psi,p,nsites,nreps)
+ {
+ k<-nreps
+ s<-nsites
+ pstar<-1-(1-p)^k
+ d1<-(s*pstar)+psi*(1-pstar)*k*p*(1-pstar)
+ d2<-s*pstar*(pstar*(1-p)-k*p*(1-pstar))
+ var_psi<-psi*(1-psi)/s +psi*(1-pstar)/(d1*d2)
+ return(sqrt(var_psi)/psi)
+ }
>
I then used this to produce plots for 2 types of comparisons of study designs. In the first, we are going to assume a fixed occupancy probability (e.g., psi=0.8) and number of sites (e.g., s=100) and then look at what cv is achieved by different combinations of detection (p = 0.2-0.5) and replicates (k =2, 3,4,5)
> #PLOTS
> #1- fixed number of sites and detection prob, vary p and n reps
> detect<-c(.2,0.3,0.4,.5)
> nsites<-100
> psi<-0.8
> sim_expts1<-data.frame(nsites=as.numeric,psi=as.numeric,p=as.numeric,nreps=as.numeric,cv=as.numeric)
> for (p in detect)
+ {
+ for (nreps in 2:5)
+ {
+
+ cv<-sample_approx(psi,p,nsites,nreps)
+ cv<-data.frame(nsites,psi,p=p,nreps=nreps,cv=cv)
+ sim_expts1<-rbind(sim_expts1,cv)
+
+ }
+ }
> sim_expts1
nsites psi p nreps cv
1 100 0.8 0.2 2 0.14737756
2 100 0.8 0.2 3 0.07564258
3 100 0.8 0.2 4 0.05927628
4 100 0.8 0.2 5 0.05413498
5 100 0.8 0.3 2 0.07892687
6 100 0.8 0.3 3 0.05617907
7 100 0.8 0.3 4 0.05208666
8 100 0.8 0.3 5 0.05091049
9 100 0.8 0.4 2 0.06035667
10 100 0.8 0.4 3 0.05203719
11 100 0.8 0.4 4 0.05067438
12 100 0.8 0.4 5 0.05028642
13 100 0.8 0.5 2 0.05426001
14 100 0.8 0.5 3 0.05080960
15 100 0.8 0.5 4 0.05025790
16 100 0.8 0.5 5 0.05010235
>
> p_levels<-detect
> lab<-paste("psi=",psi," n sites=",nsites,"p=",p_levels[4],"(solid)","p=",p_levels[1],"-",p_levels[3],"(dashed)")
> lims<-range(sim_expts1$cv)
> with(sim_expts1,plot(nreps[p==p_levels[1]],cv[p==p_levels[1]],type="l",main=lab,ylab="CV",xlab="Number of reps",xaxt="n",ylim=lims))
> axis(1, at = 1:5)
> for (px in p_levels)
+ {
+ with(sim_expts1,matlines(nreps[p==px],cv[p==px],lty=2))
+
+ }
> savePlot(filename="cv_vs_reps.by.p.jpg",type="jpg")
>
Note: savePlot does not work in RStudio. You must either just export the plot from the plot window in RStudio, or set up the graphic device as an external file. To do the latter use commands like
>jpeg(file="myplot.jpg")
>plot(x,y)
>dev.off()
This produces the plot
The second analysis assumes the same psi =0.8 and a fixed p (e.g., p=0.3) and then looks at 5 plots of cv vs. number of reps (k=2 to 5), with each plot for a specified number of sites samples (s=25, 50, 75, and 100).
Depending on the situation, one or the other (or both) of these plots (both saved in the working directory) may be useful. Alternatively, users may modify the program to incorporate cost factors and conduct a more formal optimization (e.g., using numerical optimization procedures in R).
The code for the above analyses is provided here. Finally, as noted this analysis is based on expectations and variance approximations. I have also provided here alternative code based on Monte-Carlo simulation an more exact ML estimation.