Setting up an occupancy analysis in R
Although there are several choices for accessing occupancy modeling from within R, the unmarked package provides an easy framework for MLE analysis. Additionally, the way models are formulated in unmarked is very similar to the syntax of methods used for building linear models (lm) and generalized linear models (glm) in R, and so should already be familiar.
First, make sure you are running version 14.0 or later of R. If you are not, please update to the latest version.
Second, because unmarked is not part of the standard installation of R, you will need to go to one of the CRAN mirrors and install the library before your can build models. This is done by going to the menu and selecting Packages\Install Packages. Alternatively, the R command line
>install.packages("unmarked")
can be entered either on in the console or from a script file.
Once the package is installed it resides on the hard disk of your computer and does not need to be reinstalled (unless there are updates), but will need to be loaded each time R is run using the command
>library(unmarked)
Again, this can be entered each session from a command line or script file (easier), but if you find that annoying you can have the package automatically loaded as part of the start up process by modifying the R profile. I don't do this routinely because for one thing it's not a good idea to be loading a bunch of libraries into memory unless you are actually going to be using them in that session.
A certain amount of help is available via the built command
>?unmarked
which provides documentation in html format. A pdf file provides a somewhat more detailed coverage, and an accompanying article further elaboration.
One you have the programs organized, the first step in starting an analysis is having the data organized so that it is readable by the program. An easy format to work with -- easy because it's easy to create from other programs, like Excel, Access, etc., but also because it is easily digestible by unmarked- is the familiar comma-delimited text file (*.csv) that we've been working with. For an occupancy analysis, the input data will fall into 3 categories:
Detection histories in which the rows of the data will usually be sites and the columns will represent detection occasions or other units of replication,
Site-specific covariates, which are attributes measured on each of the sites that potentially explain site-to-site variability in detection and/or occupancy, and
Sample-specific or "observation level" covariates, which are attributes measures at each of the sample occasions (replicates)
In the simplest cases (our first examples at least ) site-specific covariates do not vary over time (so are fixed in time for a site) and sample-specific covariates don't vary among sites, but more generally site-specific covariates can also vary over sample occasions, and conversely sample-specific covariates can vary among sites. Also, in our example we are going to read the detection histories and the covariates (at least the site-specific ones) from the same data file, but it is easier in many cases to store these different types (and dimensions) of datas in different files.
We will illustrate the setup of an occupancy analysis with data for Blue Grosbeaks (Guiraca caerulea) on 41 old fields planted to longleaf pines (Pinus palustris) in southern Georgia, USA. Surveys were 500 m transects across each field and were completed three times during the breeding season in 2001. The data are located in a comma-delimited text file. The first column (which is optional) is a label for each site, simple numbered 1 - 41. The next 3 columns are the detection histories for each site on each of 3 occasions during the 2001 breeding season. For now we are going to ignore the covariates and other information in the spreadsheet, and focus on the encounter histories for each site (columns B, C, and D starting in row 3 and running to row 43 (41 sites total).
To read the data in the R program needs to know where it is located on your computer. An easy way to accomplish this is to establish a 'working directory' that will house the data and receive any input files created. I use commands like this-- included in my R script:
>data_dir<-"C:/mydir"
>setwd(data_dir)
The first line creates an 'alias' called C:/mydir, which I can change at any time. The second line sets the working directory to the alias name.
I am first going to read in just the detection histories (columns 2-4). The commands to do this are:
>data<-read.csv("blgr.csv")
>#detection data rows are sites columns are detection replicates
>y<-data[,2:4]
>n<-nrow(data)
The penultimate line creates an object which contains just the detection histories; it is a 2-dimensional array with 41 rows and 3 columns. The last line is simply a counter than I can use late for dimensioning other parts of the data (like covariates).
Next, let's read in the site-specific covariates in columns 5-9.
>#site level (individual) covariates
>blgr.site<-data[,5:9]
This is also a 2-dimensional array but with 41 rows and 4 columns.
There are no time-specific covariates per se in this analysis, but we are going to want to test for time-specific variation in detection. To do that, we will need to create a time variable that behaves like a factor (rather than a numerical values) in R. We could accomplish this by creating dummy variables in a dataframe, but it is easier to define a factor in R.
The code to do this is simple:
> #create time factor and use as covariate
> #observation level (time specific) covariates
> time<-as.factor(rep(c(1,2,3),n))
> blgr.obs<-data.frame(time)
>
We confirm that time is a factor:
> time
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1
[38] 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2
[75] 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
[112] 1 2 3 1 2 3 1 2 3 1 2 3
Levels: 1 2 3
Finally, we put the detection histories and covariates together into a single unmarked data frame called blgr:
>#put everything together in unmarked data frame
>#note that covariate can come from separate files
>blgr <- unmarkedFrameOccu(y = y, siteCovs = blgr.site,obsCovs=blgr.obs)
>which can be summarized by
>#summary of unmarked data frame
>summary(blgr)
This command creates a table of summary statistics, always useful for making sure the data have been read in correctly (e.g, I should expect all zeros or 1's in the y vector, unless there were sites not sampled some occasions (NA) , but no other values).
This is actually not a standard data frame like we created above, but instead a special type of R object called an S4 object that is directly readable by the unmarked package as an occupancy data frame. At this point we could save this object and if we wanted to throw away the original data (I don't recommend that!). A few points:
Again , we could have read the covariates in from different files;
Covariates are not required if all we are doing is fitting the simplest occupancy model (no site effects, constant detection over sample). In that case we could have specified
blgr <- unmarkedFrameOccu(y = y)
The fact that we had to create dummy variable covariates in order to model a time effect (which we'll do later) will be annoying to people used to the pre-formed models in MARK and PRESENCE. It is simply the price you pay for the much greater flexibility-- and similarity to glm modeling-- you get in unmarked.
Finally, you may have wondered about the last 3 columns of csv input file, which were not read into the unmarked frame. These are counts of animals detected in each detection period, and are not used in the basic occupancy analysis. We will revisit these when we consider analyses that deal with abundance estimation via repeated counts.
Basic model: homogeneous detection and occupancy
The single-season models we are running at this point are implemented in unmarked using the function occu(). Once we've set up the unmarked occupancy data frame, it is very easy to create and run models. To start, the constant detection and occupancy model (no time, no covariate effects) is created by
>fm1<-occu(~1 ~1,blgr)
This syntax will be used over and over again, so let's get familiar with it. In general it is
model<-occu(~detection_formula ~occupancy_formula, dataframe)
where the detection and occupancy formulas are in the form of generalized linear models. In this case to be explicit we are modelling both detection and occupancy via intercept models (~1) using a logit transformation, so for example detection probability
with a corresponding, simple relationship for occupancy probability Ψ
In fact, occu() provides estimates of these 2 intercepts
so if (as is usually the case) we want estimates of p and Ψ to perform an inverse logit (expit) transformation. Unmarked has a built in backTranformation function, which we can use here. Back to the model:
>#CREATING MODELS
>#simple occupancy model
>fm1<-occu(~1 ~1,blgr)
when we type the command fm1
> fm1
Call:
occu(formula = ~1 ~ 1, data = blgr)
Occupancy:
Estimate SE z P(>|z|)
2.04 0.751 2.71 0.00666
Detection:
Estimate SE z P(>|z|)
0.206 0.242 0.851 0.395
AIC: 172.1898
So we get estimates and an AIC value, but the estimates are on the logit scale. The built-in function backTransform() will provide the real scale estimates of detection ('det') and occupancy ('state')
> #back transformations
> backTransform(fm1,'det')
Backtransformed linear combination(s) of Detection estimate(s)
Estimate SE LinComb (Intercept)
0.551 0.0599 0.206 1
Transformation: logistic
> backTransform(fm1,"state")
Backtransformed linear combination(s) of Occupancy estimate(s)
Estimate SE LinComb (Intercept)
0.885 0.0766 2.04 1
Transformation: logistic
Finally, unmarked provide an Emprical Bayes estimate and confidence interval of the number/ proportion of study sites occupied. This method essentially uses the estimate of psi to generate a vector of 0,1 binary variables z under the premise that z~Bernoulli(psi); the sum of the mode, lower, and upper quantiles from this simulation then provides an empirical CI on then number / proportion of occupied sites. To generate this for the first model estimates we use
>
> #empirical Bayes estimate of proportion of sites occupied
> s<-nrow(y)
> re <- ranef(fm1)
> EBUP <- bup(re, stat="mode")
> CI <- confint(re, level=0.95)
> print("95 % EB interval on number sites occupied")
[1] "95 % EB interval on number sites occupied"
> rbind(s_occup = c(Estimate = sum(EBUP), colSums(CI)) )
Estimate 2.5% 97.5%
s_occup 33 33 41
> print("95 % EB interval on proportion of sites occupied")
[1] "95 % EB interval on proportion of sites occupied"
> rbind(PAO = c(Estimate = sum(EBUP), colSums(CI))/s )
Estimate 2.5% 97.5%
PAO 0.804878 0.804878 1
If you compare this statistic to the back-tranformed estimate of psi you'll notice it is somewhat lower. Conceptually this represents the proportion of the finite sample (41 in this case) of sites occupied, whereas psi estimates the probability of occupancy from an infinite list of sites.
Adding covariates to detection and occupancy modeling
Given the data frame set up we've already established, it is quite straightforward to add models containing covariates for predicting detection and occupancy. I have created 5 more models with different combinations of detection and occupancy covariates:
>#time specific detection, constant occupancy
>fm2<-occu(~time ~1,blgr)
>#constant detection, occupancy predicted by bqi
>fm3<-occu(~1 ~bqi,blgr)
>#time specific detection, occupancy predicted by bqi
>fm4<-occu(~time ~bqi,blgr)
>#constant detection, occupancy predicted by log(field size)
>fm5<-occu(~1 ~log(field.size),blgr)
>#constant detection, occupancy predicted by log(field size)+bqi
>fm6<-occu(~1 ~log(field.size)+bqi,blgr)
We could build more (if they are a priori sensible) but let's stop here.
At this point we are going to want to summarize and compare the models and make some choices. One approach is to use a built-in model selection procedure. Applying that here we have
>fmlist<-fitList(p_dot_psi_dot=fm1,p_t_psi_t=fm2,p_dot_psi_bqi=fm3,p_t_psi_bqi=fm4,p_dot_psi_field=fm5,p_dot_psi_field_bqi=fm6)
> #model selection -- ranked list of models and AIC
> modSel(fmlist)
nPars AIC delta AICwt cumltvWt
p_dot_psi_dot 2 172.19 0.00 0.381 0.38
p_dot_psi_bqi 3 173.12 0.93 0.239 0.62
p_dot_psi_field 3 173.92 1.73 0.160 0.78
p_dot_psi_field_bqi 4 175.11 2.92 0.088 0.87
p_t_psi_t 4 175.30 3.11 0.081 0.95
p_t_psi_bqi 5 176.23 4.04 0.051 1.00
>
The above indicates that the model with highest support is the null p(.) psi(.) model. You can see estimates under individual models by typing the model name, eg.,
> fm3
Call:
occu(formula = ~1 ~ bqi, data = blgr)
Occupancy:
Estimate SE z P(>|z|)
(Intercept) 2.69 1.41 1.909 0.0563
bqi -1.39 1.55 -0.898 0.3690
Detection:
Estimate SE z P(>|z|)
0.206 0.242 0.851 0.395
AIC: 173.1211
However, these estimates are on the logit scale. You can obtain real-scale estimates via the backTransform() function, but when covariates are present you will need to specifiy a value for the covariates for use in the linear function that predicts p or psi. What we are going to do instead is take a model averaging approach that has basically these steps:
We specify a quantity that we are trying to predict for each model (e.g., psi)
We specify a value or range of values for the covariates; often these will be averages over the data but they can be any value.
We generate predictions under each model and compute a weighted average where the weights are the AIC weights.
We compute standard error and confidence interval on the linear scale
Finally, we transform the the estimate and CI to the real (probability scale)
We will follow these steps for the example data and model set.
Multi-model comparison and model averaging
Unmarked has built in functions that make the task of model comparison and averaging easier. There are essentially 3 steps involved:
Compute AIC and model weights for a specified list of models. This is done first by specifiying a list of models that have already been . For our example we have
>fmlist<-fitList(p_dot_psi_dot=fm1,p_t_psi_dot=fm2,p_dot_psi_bqi=fm3,p_t_psi_bqi=fm4,p_dot_psi_field=fm5,p_dot_psi_field_bqi=fm6)
Then run the list through a model selection function
>modSel(fmlist)
Finally we specify a list of covariate values that we want to predict responses for, and the parameter type (occupancy, detection) that we wish to predict. The simplest thing is to use the site-level covariates observed in the data to produces prediction
>#predict over observed covariates
> #predict over observed covariates
> new<-siteCovs(blgr)
> preddata.psi <- predict(fmlist, type = "state",new=new,appendData=T)
The data frame preddata.psi now contains predicted occupancy, se, and confidence interval for each combination of site covariates in the data. Frequently however we are interested in just predictions at certain levels of the covariates, not necessarily observed in the data. For example we can look at 6 combination of the effect of field size and bqi by:
> #create a new data frame to use for prediction
> new<-data.frame(field.size=rep(c(5,10,50),2),bqi=c(rep(0,3),rep(1,3)))
> #model averaged predictions for combinations of bqi and field size
> preddata.psi.new<- predict(fmlist, type = "state",new=new,appendData=T)
> preddata.psi.new
Predicted SE lower upper field.size bqi
1 0.9115003 0.08522441 0.5522803 0.9840954 5 0
2 0.9066233 0.08262643 0.5796188 0.9817112 10 0
3 0.8887186 0.11740230 0.4768348 0.9823565 50 0
4 0.8558248 0.12375452 0.5068063 0.9693401 5 1
5 0.8503238 0.11271154 0.5497560 0.9648285 10 1
6 0.8309011 0.14531345 0.4658041 0.9668595 50 1
Note that we enter in the value for field size in original units (ha) since the log transformation is done within the model.
All of the above steps are saved in the attached R script file, which you can modify for other applications.