CJS Models
The CJS (Cormack-Jolly-Seber) model is the backone of many of the open-population CMR models we will deal with, so a firm understanding of this model and its applications is essential. I will illustrate the CJS model with a well known example of ringing (banding) and live recapture data for the European Dipper. This data example has been used many times to illustrate CJS and was a key example in the Lebreton et al. (1992) Ecological Monograph. The data set consists of 7 marking and recapture occasions for male and female Dippers and will be used to illustrate:
Basic model structure of time (vs.cohort) dependency in survival and recapture probabiities
Use of grouping covariates (in this case, sex) , a special case of individual covariates
Use of time-specific covariates to model variation in survival or recapture
Evaluation of model fit and comparison of alternative models
I will take you through the basic steps of building CJS models in RMark below. More details are provided in Appendix C of Evan Cooch's online Gentle Introduction to MARK.
The Dipper data
First, make sure that the RMark library has been properly loaded into memory. If not already loaded use the command
>library(RMark)
The Dipper data are stored in an RMark dataframe. Recall that we discussed earlier how to format CMR data for RMark, and I will assume that your data are in this format, as are the Dipper data. Load these data using the command
> data(dipper)
We can display the data
>dipper
> dipper
ch sex
1 0000001 Female
2 0000001 Female
3 0000001 Female
4 0000001 Female
5 0000001 Female
6 0000001 Female
7 0000001 Female
8 0000001 Female
......
263 0110000 Male
264 0110110 Female
265 0111000 Female
266 0111000 Male
267 0111100 Female
268 0111100 Female
etc
What we see is that each line contains a capture history character string (1's and 0's) (7 columns in total), followed by a character "Male" or "Female" denoting sex of the bird. The capture occasions are yearly trapping efforts starting in 1980. We use this information together with the knowledge that we will use CJS modeling to "process" the data in a form that will provide more meaningful information for analysis:
#process dipper data to CJS format
dipper.processed=process.data(dipper,model="CJS",begin.time=1980,groups="sex")
This will, for example, allow us to use "sex" instead of a generic group variable for modeling, and will keep track of occasions by actual calendar years, making output more interpretable. Finally, we will create a "design data" (*.ddl) by the command
> dipper.ddl=make.design.data(dipper.processed)
Although not absolutely necessary (a ddl file is created by default in the mark() command), this step is good practice and will provide us with some useful details we can come back to later.
Creating models
Models are (deceptively) easy to create in RMark; we can create our first simple model by just using the above 2 ingredients (the processed data and ddl) as arguments in the mark() command:
> my_first_model<-mark(dipper.processed,dipper.ddl)
Several things happen as a result of this command:
You will get a fair amount of screen output describing the model, estimates, and so on.
The results (and data) will all be stored in an object called my_first_model
Several files with prefixes mark001, mark002, etc. will be created in your working directory
We will talk mostly about the first 2 of these; the 3rd will be handy for more advanced applications (or if you want to port your analysis over to the MARK GUI).
Let's first look at the screen output first few lines:
> my_first_model=mark(dipper.processed,dipper.ddl)
STOP NORMAL EXIT
Output summary for CJS model
Name : Phi(~1)p(~1)
Npar : 2
-2lnL: 666.8377
AICc : 670.866
These lines indicate, first, that a normal run (no errors) was achieved; second, they tell us what specific model was fit for the CJS analysis, and then the provide a number of additional statistics (number of parameters, deviance, AICc). In particular note that the model
Phi(~1)p(~1)
indicates that only intercepts were used to model variation in Phi and p. You may wonder "where did this model come from" and the answer is "it is the default model used by mark if a CJS model structure is specified (you did that in the "process" step). As a general rule, I recommend explicitly creating models, which we will do below. But first let's look at some of the other output. First, we have the "beta" estimates. These are estimates of parameters on the logit scale, and in this case are simply the 2 intercepts (for Phi and p), along with SEs and CIs.
Beta
estimate se lcl ucl
Phi:(Intercept) 0.2421484 0.1020127 0.0422034 0.4420933
p:(Intercept) 2.2262661 0.3251094 1.5890517 2.8634805
Following this are a series of tables containing values for the "real" (i.e., back-transformed) parameters Phi and p. Notice that these are in the form of diagonal matrices, with the rows and columns labeled by years (1980-86). Specifically, the rows stand for the release cohort (year in which a group of marked animals were released together and the columns for either the encounter occasions (p) or the interval over which survival occurred (Phi). There is a separate matrix for each parameter type, and for each group (sex) , so 4 in all (2 for Phi and 2 for p) in this case. The values in these matrices line up with the corresponding parameter index in what are known as the Parameter Index Matrices (PIMS) . We will get into these in more detail below, but briefly they will represent how parameters are potentially varying over groups, occasions, or cohorts (or combinations of these things). One thing you might have noticed is that all the values in the Phi and p matrices are a identical. This is because the model --which we will specify in the Design Matrix-- specifies no variation in any of these factors. RMark however starts from a very general premise that you may want to allow all kinds of variation in these parameters-- so these are placeholders to allow that variation. In fact, a simplified version of the PIMS would have only a single value for each of Phi and p (2 parameters), and would provide the same results.
Let's create some additional, more interesting models. These models will be based on how, given the data structure, we might expect the parameters to vary over occasions (time) or between groups (sex). For now , we are not going to allow variation due to the cohort of release, that is, we are going to use standard CJS assumptions that surviving animals from previous cohorts are captured and survive with the same probabilities as new releases. Later we will relax this under assumptions that allow for cohort (usually this means age) dependency.
We will represent these types of variation for each parameter by creating formulas that should look familiar by now-- because they follow standard R formatting for generalized linear models. For Phi we will have:
> Phi.dot=list(formula=~1)
> Phi.sex=list(formula=~sex)
> Phi.t=list(formula=~time)
> Phi.sex.t=list(formula=~sex+time)
> Phi.sex.T=list(formula=~sex*time)
representing respectively no variation (intercept only), variation by sex only, sex + time additive model, sex*time interactive model. Likewise for p we have
> p.dot=list(formula=~1)
> p.sex=list(formula=~sex)
> p.t=list(formula=~time)
> p.sex.t=list(formula=~sex+time)
> p.sex.T=list(formula=~sex*time)
We can put these together to form 25 alternative models (5 Phi * 5 p). I'll just show the commands for the first 5 but the rest are in the Rscript file I created.
>mod1=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.dot))
>mod2=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.sex))
>mod3=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.t))
>mod4=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.sex.t))
>mod5=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.sex.T))
After running all 25 models, we can use the collect.models() built-in function to sort them out by AIC values. This function just looks for all the models that have been run and are still in memory, so remember that we actually ran mod1 twice (the first time as "mymodel" with the default intercept values). We don't want this to appear twice in the AIC tables, so let's remove one (doesn't matter which)
>rm(myexample)
>collect.models()
I have saved the screen output to a text file. Ignore the first column (an arbitrary index for each model) and focus the AIC values and AIC weights. The models are ordered by ascending AIC values and descending weights, with the "best" models first. Clearly the simple Phi(~1)p(~1) model ranks best, but is closely followed by models allowing sex variation in Phi or p. On the other hand, most of the models incorporating time effects rank fairly low. We will discuss later how to deal with model uncertainty (e.g., model selection, model averaged estimation).
Keep in mind that we still have the model objects for any model we didn't delete (remove), and these can be used to generate summaries and other information. For instance
>summary(mod10)
produces a table of summary values for mod10
>mod10$results
produces a detailed set of results including variance-covariance matrices.
>mod10$output
tells us where the MARK files are associated with this model (handy if we want to port things over to MARK).
>attributes(mod10)
gives a full set of attributes of this object.
Now, some of you have used MARK before and may have even run the Dipper example. If so, you know that you could use the MARK GUI to create the above models (and more) as "pre-defined models", by manipulating PIM charts, design matrices, etc. So maybe you are not impressed that we just wrote a bunch of script to do the same thing. In my mind there are several reasons why you might actually prefer the RMark approach:
By and large RMark forces you to explicitly come to grips with what the model is, in notation that is fairly clear, something that may not happen at all if you auto-magically create a bunch of models
As you build more and more complex models, and many of them, you will eventually reach a point where creating models in the MARK interface becomes tedious and error prone.
The graphical users interface (GUI) in MARK places a certain amount of additional overhead on computing resources and in extreme cases can lead to crashes. RMark eliminates this overhead while interacting directly with the numerical guts of MARK.
Once you build a basic set of model statements many models can be created quickly by combining scripts, as illustrated in the above example. Once I built the 5 statements describing variation in Phi and p, and incorporated one set of these into the mark() function, I was able to quickly create 25 models by copying, pasting, and a relatively modest amount of text editing. It took all of 10 minutes, and I type slowly.
Script writing builds character and will make you a better person.
PIMS and Design matrices: a closer look
As mentioned earlier, RMark allows for extremely flexible and general modeling of CJS and other data. Consequently, there is a certain amount of imbedded complexity that the average user may or may not need to know about. Nevertheless, there are some basics that you should know about. I will try to present these in as unconfusing a way as possible, returning to the Dipper example to illustrate various points. Taking that example, let's consider what was accomplished by the following steps
> data_dir<-"C:/users/mike/Dropbox/teaching/WILD8390/spring2014/CJS"
> setwd(data_dir)
> rm(list=ls())
>
> library(RMark)
> #load dataframe containing dipper example
> data(dipper)
> #process dipper data to CJS format
The first few lines took care of some basic tasks such as setting a working directory, loading the RMark library, and loading the Dipper data. The line
> dipper.processed=process.data(dipper,model="CJS",begin.time=1980,groups="sex")
established that we are following a basic CJS (vs. some other) model structure, that the first time occasion is to be labeled as 1980, and that there is a grouping variable named "sex". The line
> dipper.ddl=make.design.data(dipper.processed)
creates a very general design data structure for these data and this model. If you examine this data structure (captured here in a text file) you can see that 1) it is composed of 2 sublists, one for Phi and one for p, and 2) for each there are 42 rows. The rows correspond to the "unsimplified Parameter Index Matrix (PIM)" that we will look at in more detail below; the columns are various parameter and model indices and factors that potentially allow modeling by group, time, cohort, age, and other factors. This data structure is built using the make.design.data() function, and can be added to later on as we will see (for instance, to allow inclusion of other covariates). When we build models that include various terms (time, sex, etc) these terms must reference the columns in this dataframe (e.g., if you mis-spell sex as Sex rmark will not know what you are talking about).
Let's build a few models to show how this works in practice. Again, we can define a handful of formulas for Phi and p
> Phi.sex=list(formula=~sex)
> Phi.t=list(formula=~time)
> Phi.sex.t=list(formula=~sex+time)
> p.sex=list(formula=~sex)
> p.t=list(formula=~time)
> p.sex.t=list(formula=~sex+time)
and use these to build some models. Let's build 3 models, in descending order of complexity:
>#create a model with sex+time for both phi and p
>mod1=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.sex.t,p=p.sex.t))
>#create a simpler model sex only for phi sex+time for p
>mod2=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.sex,p=p.sex.t))
#create an even model with phi.sex and p.sex
>mod3=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.sex,p=p.sex))
We're going to ignore the parameter estimates and so on for now, and concentrate on the parameter and model structures for each of these models. Taking mod1, the most complex of the 3, let's first examine the "unsimplified" PIMs.
>PIMS(mod1,"Phi",simplified=FALSE)
>PIMS(mod1,"p",simplified=FALSE)
These contain 84 parameter indices, allowing group,time, and cohort specificity in both Phi and p, far more detail than is going to be supported by data or even statistically identifiable. In fact, the unsimplified PIMs for all our models will have this same dimension. But, even our more complex models are going to simplify this structure somewhat by collapsing some dimensions. A standard simplification in CJS models is to specify that Phi and p depend on time (occasion) but not release cohort. We can display the simplified PIMS by
>PIMS(mod1,"Phi",simplified=T)
>PIMS(mod1,"p",simplified=T)
When you do this you will see that the indices change across columns (occasions) in each group, but no longer across rows; this results in a reduction from 84 to 24 parameter indices, and represents the full group*time CJS model.
However, we are not done-- the design further operates on the PIMS to create the model specified in our model script. Recall that this model is sex+time for both Phi and p, and so lacks the interaction term. We can see the design
>#rows for Phi
>mod1$design.matrix[1:12,]
>#rows for p
>mod1$design.matrix[13:24,]
You will see that this design matrix has 24 rows (1 for each parameter in the PIMs) but only 14 columns. The columns relate to the parameters of the (logit-scale) linear model used to create the model. E.g., all the Phi rows have a 1 in the intercept column, a 1 or 0 in the sex column (0 if female, 1 if male), and a 1 or 0 in each time occasion column denoting which time occasion is being modeled (the first occasion being represented by the intercept).
Let's continue with the slightly simpler model, mod2. This model considers sex but not time variation for Phi, and sex+time variation for p. Display the simplified PIM structure for this model
>#simplified PIM structure
>PIMS(mod2,"Phi",simplified=T)
>PIMS(mod2,"p",simplified=T)
You will see that only 2 parameter indices are now needed for Phi (one for each sex), but 12 are needed for p, for 14 in total. The design matrix then operates on this simpler model; it has 14 rows (one for each parameter index) but only 9 columns: an intercept and sex column for Phi; intercept, sex, and time columns for p. If you get a summary of the estimates of this model
>summary(mod2)
you will see that there are 9 "Beta" parameters estimated; these estimates correspond to the logit-scale intercept and coefficients for each of the columns of the design matrix, and are used to derived the 'real' parameter estimates by inverse logit tranformation. Finally, consider mod3, with only group effects for Phi and p. The simplified PIM structure
>PIMS(mod3,"Phi",simplified=T)
>PIMS(mod3,"p",simplified=T)
has just 4 indices, and the design matrix
> mod3$design.matrix
has 4 rows and 4 columns, so in fact the design matrix has not really changed anything (there are 4 estimated parameters).
All this seems fairly complicated but is actually relatively straightforward and is done behind the scenes for you if you follow these steps:
Correctly specify the data and model structure in the process statement
Make sure all model terms that are used are included in the design data (again, this is done automatically for simple cases, but will require some manipulation for more complex ones)
Write out the terms correctly for the linear logit models
Just keep in mind that
The actual model is the one that results from the computations in the design-PIM matrices
The number of estimable parameters in the model is always the same as the number of columns in the design matrix.
I have included the above commands in an R script file for your convenience.
Power Point Presentation displaying PIMS and Design Matrices for several models for the Dipper example
Including covariates in the model
There are 2 different types of covariates that we can consider in a CJS analysis and they are handled differently.
Design covariates relate to the specified model structure and include time, cohorts, and groups (e.g., sex in the Dipper example). These are handled by manipulating the design data structure (ddl)
Individual covariates are attributes that vary from individual animal to animal (e.g., mass upon capture). Individual covariates remain with the individual encounter history file (as additional columns on that file)
As pointed out in Appendix C, this is not a truly clear-cut separation, since obviously some attributes like sex do vary from animal to animal. For most purposes, design covariates are attributes that are categorical and which lend themselves to grouping animals, and individual covariates are numerical attributes (length, mass, etc.). We already have seen an example of individual covariates, in the known fates analysis of black ducks in Week 8. Here I use the Dipper example to illustrate a couple of ways that design covariates can be included.
Lebreton et al. (1992) incorporated one of the first (if not the very first) uses of a time-specific covariate in CJS analysis. They observed that years between 1980 and 1986 were either flood years (1981, 1982) or normal years, and created an indicator variable =1 if the year was flood and 0 if normal. The code to incorporate this into the dll requires some fairly simple R data handling tricks:
> dipper.ddl$Phi$Flood=0
> dipper.ddl$Phi$Flood[dipper.ddl$Phi$time==1981 | dipper.ddl$Phi$time==1982]=1
The modified dll is then used with a series of model formulas and definitions to create and compare 3 alternative models;
>Phi.Flood=list(formula=~Flood)
>Phi.dot=list(formula=~1)
>Phi.time=list(formula=~time)
>p.dot=list(formula=~1)
>dipper.phi.flood.p.dot=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.Flood,p=p.dot))
>dipper.phi.t.p.dot=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.time,p=p.dot))
>dipper.phi.dot.p.dot=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.dot))
>collect.models()
The results indicated fairly strong evidence (AIC weight of 0.89) for flood as a predictor of survival, with this model performing better than either the Phi.dot model or the model with generic variation in time (Phi.time). The beta estimate of -0.5599740 (95 % CI= -0.9840706 ,-0.1358773) indicates that survival was lower in flood than normal years.
I also used the artificial "effort" variable constructed in Appendix C to illustrate how slightly more complicated covariates can be incorporated. Here a capture effort variable was created in a dataframe
> df=data.frame(group=c(rep("Female",6),rep("Male",6)),time=rep(c(1981:1986),2), effort=rep(c(10,5,2,8,1,2),2))
>df
group time effort
1 Female 1981 10
2 Female 1982 5
3 Female 1983 2
4 Female 1984 8
5 Female 1985 1
6 Female 1986 2
7 Male 1981 10
8 Male 1982 5
9 Malco 1983 2
10 Male 1984 8
11 Male 1985 1
12 Male 1986 2
This is then merged with the dll using a built-in function merge_design.covariates():
> dipper.ddl$p=merge_design.covariates(dipper.ddl$p,df,bygroup=TRUE)
Then parameter formulas are constructed involving this covariate as a predictor of p, and used to construct some additional models
> p.effort.plus.sex=list(formula=~effort+sex)
> p.effort=list(formula=~effort)
>
> dipper.phi.dot.p.effort.sex=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.effort.plus.sex))
>dipper.phi.dot.p.effort=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.effort))
>dipper.phi.flood.p.effort=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.Flood,p=p.effort))
>collect.models()
In this example, the Phi(~Flood)p(~effort) now winds up being one of the top models, indicating some support for the effort covariate.
I have included these examples in an R script for convenience.
Model fit, model selection, and model averaging
Model fit
As seen in earlier examples (e.g., the closed CMR models in Week 6) there are a number of approaches for goodness of fit. CJS models have actually received a lot of attention in this area, and some well-developed procedures are available. I have already mentioned the deviance / df approach, and that can work reasonably well for CJS models. Program RELEASE conducts a series of chi-square contingency table tests for standard CJS models, in which a goodness of fit test is produced for the general groups*time model (the most complex model we have created so far). The commands to perform the RELEASE gof test in RMark for the dipper example are:
> dipper.processed=process.data(dipper,groups=("sex"))
> release.gof(dipper.processed)
RELEASE NORMAL TERMINATION
Chi.square df P
TEST2 7.5342 6 0.2743
TEST3 10.7735 15 0.7685
Total 18.3077 21 0.6295
The test suggests quite good model fit. By comparison a computation of deviance/ df for that model (mod25) provides
> mod25$results$deviance/mod25$results$deviance.df
[1] 3.761788
which suggests some lack of fit or overdispersion. We will pursue this further with a parametric bootstrap test of fit.
Multi-model inference
In general, we are not going to want to assume that a single model is operative, but rather make inference (estimate parameters, make comparisons, make predictions) taking into account model uncertainty as expressed by the AIC weights. We already did this for known fates (illustrated by the black duck problem); the approach is very similar under CJS. I illustrate this with an example from Great Tits, a titmouse ringed (banded) in Spain as part of a study of selection for "tie size" (size of a black breast band), a marking on males that seems to be correlated with survival (Senar et al. in press). Here are the data (in MARK format) for adult and subadult Great Tits, and the R code to format the data for RMark, read individual covariates, conduct age-specific CJS analysis, and compute and plot model-averaged predictions of survival.