CJS Models

The CJS (Cormack-Jolly-Seber) model is the backone of many of the open-population CMR models we will deal with, so a firm understanding of this model and its applications is essential. I will illustrate the CJS model with a well known example of ringing (banding) and live recapture data for the European Dipper.   This data example has been used many times to illustrate CJS and was a key example in the Lebreton et al. (1992) Ecological Monograph.  The data set consists of 7 marking and recapture occasions for male and female Dippers and will be used to illustrate:

I will take you through the basic steps of building CJS models in RMark below.  More details are provided in Appendix C of Evan Cooch's online Gentle Introduction to MARK

The Dipper data

First, make sure that the RMark library has been properly loaded into memory.  If not already loaded use the command

>library(RMark)

The Dipper data are stored in an RMark dataframe. Recall that we discussed earlier how to format CMR data for RMark, and I will assume that your data are in this format, as are the Dipper data.  Load these data using the command

> data(dipper)

We can display the data 

>dipper

> dipper

         ch    sex

1   0000001 Female

2   0000001 Female

3   0000001 Female

4   0000001 Female

5   0000001 Female

6   0000001 Female

7   0000001 Female

8   0000001 Female

......

263 0110000   Male

264 0110110 Female

265 0111000 Female

266 0111000   Male

267 0111100 Female

268 0111100 Female

etc

What we see is that each line contains a capture history character string (1's and 0's) (7 columns in total), followed by a character "Male" or "Female" denoting sex of the bird.  The capture occasions are yearly trapping efforts starting in 1980.  We use this information together with the knowledge that we will  use CJS modeling to "process" the data in a form that will provide more meaningful information for analysis:

#process dipper data to CJS format

dipper.processed=process.data(dipper,model="CJS",begin.time=1980,groups="sex")

This will, for example, allow us to use "sex" instead of a generic group variable for modeling, and will keep track of occasions by actual calendar years, making output more interpretable. Finally, we will create a "design data" (*.ddl) by the command

> dipper.ddl=make.design.data(dipper.processed)

Although not absolutely necessary (a ddl file is created by default in the mark() command), this step is good practice and will provide us with some useful details we can come back to later.

Creating models

Models are (deceptively) easy to create in RMark; we can create our first simple model by just using the above 2 ingredients (the processed data and ddl) as arguments in the mark() command:

> my_first_model<-mark(dipper.processed,dipper.ddl)

Several things happen as a result of this command:

We will talk mostly about the first 2 of these; the 3rd will be handy for more advanced applications (or if you want to port your analysis over to the MARK GUI).

Let's first look at the screen output first few lines:

> my_first_model=mark(dipper.processed,dipper.ddl)

STOP NORMAL EXIT

Output summary for CJS model

Name : Phi(~1)p(~1) 

Npar :  2

-2lnL:  666.8377

AICc :  670.866

These lines indicate, first, that a normal run (no errors) was achieved; second, they tell us what specific model was fit for the CJS analysis, and then the provide a number of additional statistics (number of parameters, deviance, AICc).   In particular note that the  model

Phi(~1)p(~1) 

indicates that only intercepts were used to model variation in Phi and p.  You may wonder "where did this model come from" and the answer is "it is the default model used by mark if a CJS model structure is specified (you did that in the "process" step). As a general rule, I recommend explicitly creating models, which we will do below.  But first let's look at some of the other output. First, we have the "beta" estimates. These are estimates of parameters on the logit scale, and in this case are simply the 2 intercepts (for Phi and p), along with SEs and CIs. 

Beta

                 estimate        se       lcl       ucl

Phi:(Intercept) 0.2421484 0.1020127 0.0422034 0.4420933

p:(Intercept)   2.2262661 0.3251094 1.5890517 2.8634805

Following this are a series of tables containing values for the "real" (i.e., back-transformed) parameters Phi and p. Notice that these are in the form of diagonal matrices, with the rows and columns labeled by years (1980-86).  Specifically, the rows stand for the release cohort (year in which a group of marked animals were released together and the columns for either the encounter occasions (p) or the interval over which survival occurred (Phi).  There is a separate matrix for each parameter type, and for each group (sex) , so 4 in all (2 for Phi and 2 for p) in this case.  The values in these matrices line up with the corresponding parameter index in what are known as the Parameter Index Matrices (PIMS) .  We will get into these in more detail below, but briefly they will represent how parameters are potentially varying over groups, occasions, or cohorts (or combinations of these things). One thing you might have noticed is that all the values in the Phi and p matrices are a identical.  This is because the model --which we will specify in the Design Matrix-- specifies no variation in any of these factors. RMark however starts from a very general premise that you may want to allow all kinds of variation in these parameters-- so these are placeholders to allow that variation.  In fact, a simplified version of the PIMS would have only a single value for each of Phi and p (2 parameters), and would provide the same results. 

Let's create some additional, more interesting models.  These models will be based on how, given the data structure, we might expect the parameters to vary over occasions (time) or between groups (sex).   For now , we are not going to allow variation due to the cohort of release, that is, we are going to use standard CJS assumptions that surviving animals from previous cohorts are captured and survive with the same probabilities as new releases.  Later we will relax this under assumptions that allow for cohort (usually this means age) dependency.    

We will represent these types of variation for each parameter by creating formulas that should look familiar by now-- because they follow standard R formatting for generalized linear models.   For Phi we will have:

> Phi.dot=list(formula=~1)

> Phi.sex=list(formula=~sex)

> Phi.t=list(formula=~time)

> Phi.sex.t=list(formula=~sex+time)

> Phi.sex.T=list(formula=~sex*time)

representing respectively no variation (intercept only), variation by sex only, sex + time additive model, sex*time interactive model. Likewise for p we have

> p.dot=list(formula=~1)

> p.sex=list(formula=~sex)

> p.t=list(formula=~time)

> p.sex.t=list(formula=~sex+time)

> p.sex.T=list(formula=~sex*time)

We can put these together to form 25 alternative models (5 Phi * 5 p).  I'll just show the commands for the first 5 but the rest are in the Rscript file I created. 

>mod1=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.dot))

>mod2=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.sex))

>mod3=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.t))

>mod4=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.sex.t))

>mod5=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.sex.T))

After running all 25 models, we can use the collect.models() built-in function to sort them out by AIC values. This function just looks for all the models that have been run and are still in memory, so remember that we actually ran mod1 twice (the first time as "mymodel" with the default intercept values).  We don't want this to appear twice in the AIC tables, so let's remove one (doesn't matter which)

>rm(myexample)

>collect.models()

I have saved the screen output to a text file. Ignore the first column (an arbitrary index for each model) and focus the AIC values and AIC weights. The models are ordered by ascending AIC values and descending weights, with the "best" models first. Clearly the simple Phi(~1)p(~1)  model ranks best, but is closely followed by models allowing sex variation in Phi or p.  On the other hand, most of the models incorporating time effects rank fairly low. We will discuss later how to deal with model uncertainty (e.g., model selection, model averaged estimation). 

Keep in mind that we still have the model objects for any model we didn't delete (remove), and these can be used to generate summaries and other information. For instance

>summary(mod10)

produces a table of summary values for mod10

>mod10$results 

produces a detailed set of results including variance-covariance matrices. 

>mod10$output

tells us where the MARK files are associated with this model (handy if we want to port things over to MARK).

>attributes(mod10)

gives a full set of attributes of this object. 

Now, some of you have used MARK before and may have even run the Dipper example. If so, you know that you could use the MARK GUI to create the above models (and more) as "pre-defined models", by manipulating PIM charts, design matrices, etc.   So maybe you are not impressed that we just wrote a bunch of script to do the same thing.  In my mind there are several reasons why you might actually prefer the RMark approach:

PIMS and Design matrices: a closer look

As mentioned earlier, RMark allows for extremely flexible and general modeling of CJS and other data.  Consequently, there is a certain amount of imbedded complexity that the average user may or may not need to know about. Nevertheless, there are some basics that you should know about.  I will try to present these in as unconfusing a way as possible, returning to the Dipper example to illustrate various points.  Taking that example, let's consider what was accomplished by the following steps

> data_dir<-"C:/users/mike/Dropbox/teaching/WILD8390/spring2014/CJS"

> setwd(data_dir)

> rm(list=ls())

>

> library(RMark)

> #load dataframe containing dipper example

> data(dipper)

> #process dipper data to CJS format

The first few lines took care of some basic tasks such as setting a working directory, loading the RMark library, and loading the Dipper data.   The line

> dipper.processed=process.data(dipper,model="CJS",begin.time=1980,groups="sex")

established that we are following a basic CJS (vs. some other) model structure, that the first time occasion is to be labeled as 1980, and that there is a grouping variable named "sex".  The line

> dipper.ddl=make.design.data(dipper.processed)

creates a very general design data structure for these data and this model.  If you examine this data structure (captured here in a text file) you can see that 1) it is composed of 2 sublists, one for Phi and one for p, and 2) for each there are 42 rows.  The rows correspond to the "unsimplified Parameter Index Matrix (PIM)" that we will look at in more detail below; the columns are various parameter and model indices and factors that potentially allow modeling by group, time, cohort, age, and other factors.  This data structure is built using the make.design.data() function, and can be added to later on as we will see (for instance, to allow inclusion of other covariates).  When we build models that include various terms (time, sex, etc) these terms must reference the columns in this dataframe (e.g., if you mis-spell sex as Sex rmark will not know what you are talking about).

 

Let's build a few models to show how this works in practice. Again, we can define a handful of formulas for Phi and p

> Phi.sex=list(formula=~sex)

> Phi.t=list(formula=~time)

> Phi.sex.t=list(formula=~sex+time)

> p.sex=list(formula=~sex)

> p.t=list(formula=~time)

> p.sex.t=list(formula=~sex+time)

and use these to build some models.  Let's build 3 models, in descending order of complexity:

 

>#create a model with sex+time for both phi and p

>mod1=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.sex.t,p=p.sex.t))

>#create a simpler model sex only for phi sex+time for p

>mod2=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.sex,p=p.sex.t))

#create an even model with phi.sex and p.sex

>mod3=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.sex,p=p.sex))

 

We're going to ignore the parameter estimates and so on for now, and concentrate on the parameter and model structures for each of these models.  Taking mod1, the most complex of the 3, let's first examine the "unsimplified" PIMs. 

>PIMS(mod1,"Phi",simplified=FALSE)

>PIMS(mod1,"p",simplified=FALSE)

 

These contain 84 parameter indices, allowing group,time, and cohort specificity in both Phi and p, far more detail than is going to be supported by data or even statistically identifiable.  In fact, the unsimplified PIMs for all our models will have this same dimension.  But, even our more complex models are going to simplify this structure somewhat by collapsing some dimensions.  A standard simplification in CJS models is to specify that Phi and p depend on time (occasion) but not release cohort.  We can display the simplified PIMS by

>PIMS(mod1,"Phi",simplified=T)

>PIMS(mod1,"p",simplified=T)

When you do this you will see that the indices change across columns (occasions) in each group, but no longer across rows; this results in a reduction from 84 to 24 parameter indices, and represents the full group*time CJS model. 

However, we are not done-- the design further operates on the PIMS to create the model specified in our model script.  Recall that this model is sex+time for both Phi and p, and so lacks the interaction term.  We can see the design

>#rows for Phi

>mod1$design.matrix[1:12,]

>#rows for p

>mod1$design.matrix[13:24,]

You will see that this design matrix has 24 rows (1 for each parameter in the PIMs) but only 14 columns. The columns relate to the parameters of the (logit-scale) linear model used to create the model. E.g., all the Phi rows have a 1 in the intercept column, a 1 or 0 in the sex column (0 if female, 1 if male), and a 1 or 0 in each time occasion column denoting which time occasion is being modeled (the first occasion being represented by the intercept).

 

Let's continue with the slightly simpler model, mod2.  This model considers sex but not time variation for Phi, and sex+time variation for p. Display the simplified PIM structure for this model

>#simplified PIM structure

>PIMS(mod2,"Phi",simplified=T)

>PIMS(mod2,"p",simplified=T)

You will see that only 2 parameter indices are now needed for Phi (one for each sex), but 12 are needed for p, for 14 in total. The design matrix then operates on this simpler model; it has 14 rows (one for each parameter index) but only 9 columns: an intercept and sex column for Phi; intercept, sex, and time columns for p.  If you get a summary of the estimates of this model

>summary(mod2)

you will see that there are 9 "Beta" parameters estimated; these estimates correspond to the logit-scale intercept and coefficients for each of the columns of the design matrix, and are used to derived the 'real' parameter estimates by inverse logit tranformation. Finally, consider mod3, with only group effects for Phi and p. The simplified PIM structure

 >PIMS(mod3,"Phi",simplified=T)

>PIMS(mod3,"p",simplified=T)

 has just 4 indices, and the design matrix

> mod3$design.matrix

has 4 rows and 4 columns, so in fact the design matrix has not really  changed anything (there are 4 estimated parameters).

 

All this seems fairly complicated but is actually relatively straightforward and is done behind the scenes for you if you follow these steps:

 Just keep in mind that

I have included the above commands in an R script file for your convenience.

Power Point Presentation displaying PIMS and Design Matrices for several models for the Dipper example

 

Including covariates in the model

 

There are 2 different types of covariates that we can consider in a CJS analysis and they are handled differently. 

As pointed out in Appendix C, this is not a truly clear-cut separation, since obviously some attributes like sex do vary from animal to animal.  For most purposes, design covariates are attributes that are categorical and which lend themselves to grouping animals, and individual covariates are numerical attributes (length, mass, etc.).   We already have seen an example of individual covariates, in the known fates analysis of black ducks in Week 8.  Here I use the Dipper example to illustrate a couple of ways that design covariates can be included.

 

Lebreton et al. (1992) incorporated one of the first (if not the very first) uses of a time-specific covariate in CJS analysis.  They observed that years between 1980 and 1986 were either flood years (1981, 1982) or normal years, and created an indicator variable =1 if the year was flood and 0 if normal.  The code to incorporate this into the dll requires some fairly simple R data handling tricks:

> dipper.ddl$Phi$Flood=0

> dipper.ddl$Phi$Flood[dipper.ddl$Phi$time==1981 | dipper.ddl$Phi$time==1982]=1

The modified dll is then used with a series of model formulas and definitions to create and compare 3 alternative models;

>Phi.Flood=list(formula=~Flood)

>Phi.dot=list(formula=~1)

>Phi.time=list(formula=~time)

>p.dot=list(formula=~1)

>dipper.phi.flood.p.dot=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.Flood,p=p.dot))

>dipper.phi.t.p.dot=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.time,p=p.dot))

>dipper.phi.dot.p.dot=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.dot))

>collect.models()

The results indicated fairly strong evidence (AIC weight of 0.89) for flood as a predictor of survival, with this model performing better than either the  Phi.dot model or the model with generic variation in time (Phi.time). The  beta estimate of -0.5599740 (95 % CI= -0.9840706 ,-0.1358773) indicates that survival was lower in flood than normal years.

 I also used the artificial "effort" variable constructed in Appendix C to illustrate how slightly more complicated covariates can be incorporated.  Here a capture effort variable was created in a dataframe

> df=data.frame(group=c(rep("Female",6),rep("Male",6)),time=rep(c(1981:1986),2),  effort=rep(c(10,5,2,8,1,2),2))

>df

group time effort

1  Female 1981     10

2  Female 1982      5

3  Female 1983      2

4  Female 1984      8

5  Female 1985      1

6  Female 1986      2

7    Male 1981     10

8    Male 1982      5

9    Malco 1983      2

10   Male 1984      8

11   Male 1985      1

12   Male 1986      2

This is then merged with the dll using a built-in function merge_design.covariates():

> dipper.ddl$p=merge_design.covariates(dipper.ddl$p,df,bygroup=TRUE)

Then parameter formulas are constructed involving this covariate as a predictor of p, and used to construct some additional models

> p.effort.plus.sex=list(formula=~effort+sex)

> p.effort=list(formula=~effort)

>

> dipper.phi.dot.p.effort.sex=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.effort.plus.sex))

>dipper.phi.dot.p.effort=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.dot,p=p.effort))

>dipper.phi.flood.p.effort=mark(dipper.processed,dipper.ddl,model.parameters=list(Phi=Phi.Flood,p=p.effort))

>collect.models()

In this example, the Phi(~Flood)p(~effort) now winds up being one of the top models, indicating some support for the effort covariate.

I have included these examples in an R script for convenience.

 

Model fit, model selection, and model averaging

Model fit

As seen in earlier examples (e.g., the closed CMR models in Week 6) there are a number of approaches for goodness of fit.  CJS models have actually received a lot of attention in this area, and some well-developed procedures are available.  I have already mentioned the deviance / df approach, and that can work reasonably well for CJS models.  Program RELEASE conducts a series of chi-square contingency table tests for standard CJS models, in which a goodness of fit test is produced for the general groups*time model (the most complex model we have created so far). The commands to perform the RELEASE gof test in RMark for the dipper example are:

> dipper.processed=process.data(dipper,groups=("sex"))

> release.gof(dipper.processed)

RELEASE NORMAL TERMINATION

      Chi.square df      P

TEST2     7.5342  6 0.2743

TEST3    10.7735 15 0.7685

Total    18.3077 21 0.6295

The test suggests quite good model fit.  By comparison a computation of deviance/ df for that model (mod25) provides

> mod25$results$deviance/mod25$results$deviance.df

[1] 3.761788

which suggests some lack of fit or overdispersion.  We will pursue this further with a parametric bootstrap test of fit.

 

Multi-model inference

In general, we are not going to want to assume that a single model is operative, but rather make inference (estimate parameters, make comparisons, make predictions) taking into account model uncertainty as expressed by the AIC weights.  We already did this for known fates (illustrated by the black duck problem); the approach is very similar under CJS. I illustrate this with an example from Great Tits, a titmouse ringed (banded) in Spain  as part of a study of selection for "tie size" (size of a black breast band), a marking on males that seems to be correlated with survival (Senar et al. in press).   Here are the data (in MARK format)  for adult and subadult Great Tits, and the R code to format the data for RMark, read individual covariates, conduct age-specific CJS analysis, and compute and plot model-averaged predictions of survival.