Creating MARK models using the scripting approach in RMark (Dipper example)

We will now go back and get the Dipper example we just ran in MARK, and illustrate how to construct the same models in RMark using script, get output similar to MARK, provide model averaging, and more.

Again, I must emphasize, we are not performing any fundamentally different analyses using RMark, since RMark requires MARK to do the computations. What we are gaining includes:

A bit more in-depth understanding of what these models actually are composed of
Some nicer data handling (input, output, graphing) provided by R
More flexibility when it comes to creating 'non- standard' models involving complex group and age/cohort structure, covariates, and prediction

But initially, we are mainly reproducing analyses that we could have done in MARK (and vice versa), because it's important to understand how a tool works before we get 'creative'.

First, we need to make sure that RMark is available (installed).

> require(RMark)

We will then load the Dipper example, which is already saved as an RMark object (it is one of the built in data examples). We could have read this from original data (or reformatted the MARK input file) , but there is no need because these are identical data except for the formatting.

> #load pre-formatted RMark data object for this example

> data(dipper)

If you examine the dipper data (simply put >dipper on the command line) you will see that it is a series of capture history strings (these are character strings, not numbers) followed by "Male" or "Female". In general, that second column could be an attribute (or we could have several attributes, like age, weight, etc.) of the individual animal that is recorded on its first capture. In this case "Male" or "Female" defines a grouping variable or factor, and we must tell R this fact, in addition to information as to the basic data structure (CJS, here). Optionally, we can tell RMark that the initial capture time is 1980 (which it is).

> #process dipper data to CJS format

> #necessary because sex is a factor (group) not a continuous covariate

> dipper.processed=process.data(dipper,model="CJS",begin.time=1980,groups="sex")

The next step is optional (because it is done automatically in the mark() function wrapper) but will help us see the details of how models are built; it also will be needed if we need to (later on) add or manipulate covariates.

> #define design data

> #this is not necessary because it is automatically created in the mark() function.

> #but it allows us to see (and change) all the details of the design data

> dipper.ddl<-make.design.data(dipper.processed)

Building models and understanding PIMs and DMs

We are ready to start building some models. As a preliminary step, we can create some model parameter formulas that we will use below in various combinations to build models. As an example

> Phi.sex=list(formula=~sex)

specifies that Phi varies only as a function of sex (not time) ; we will see more specifically how this works, below.

others we can create are

> Phi.t=list(formula=~time) #time

> Phi.sex.t=list(formula=~sex+time) #additive sex and time

> Phi.sex.T=list(formula=~sex*time) #interactive sex and time

> p.sex=list(formula=~sex)

> p.t=list(formula=~time)

> p.sex.t=list(formula=~sex+time)

> p.sex.T=list(formula=~sex*time)

> Phi.dot=list(formula=~1) #no variation over groups or time

> p.dot=list(formula=~1) #no variation over groups or time

Ok let's build a model. We will start with the model that specifies interactive sex and time effects

> #Phi(g*t)p(g*t) model

>Phi.sex.T_p.sex.T=mark(dipper.processed,model.parameters=list(Phi=Phi.sex.T,p=p.sex.T))

When you run these commands, you will get summary output that will include

AIC statistics (adjusted or not)
Parameter count (adjusted or not)
Beta (design matrix parameters) estimates
Real parameter (Phi and psi) estimates corresponding to the PIMs

We will discuss each of these as we proceed. Before we do, let's take a minute to discuss PIMs, because these will be a source of confusion but really aren't that complicated. First, for this model we will display the 'unsimplified' PIMs. For the CJS model, the parameters (Phi and p) potentially vary both by release cohort and time, resulting in 84 parameters altogether (42 for Phi and p each). E.g., for Phi we have

> #unsimplified (all different PIMS)

> PIMS(Phi.sex.T_p.sex.T,"Phi",simplified=FALSE)

group = sexFemale

1980 1981 1982 1983 1984 1985

1980 1 2 3 4 5 6

1981 7 8 9 10 11

1982 12 13 14 15

1983 16 17 18

1984 19 20

1985 21

group = sexMale

1980 1981 1982 1983 1984 1985

1980 22 23 24 25 26 27

1981 28 29 30 31 32

1982 33 34 35 36

1983 37 38 39

1984 40 41

1985 42

Corresponding to this unsimplified PIM structure we have design data that describe very general variation in Phi and p as function of group (sex in this case) , cohort, "age" (time since release), and time occasion. We can display this if we want from the object created earlier:

> dipper.ddl

If we display this for each parameter we will see that each component (Phi, p) has 42 rows (1 for each unsimplified PIM index) and 11 columns (containing various group, cohort, and age attributes). We will come back to this later because it contains the raw ingredients for building complex design matrices.

However, the basic CJS model is not cohort specific, it is time-specific, so that the (simplified) PIMS for Phi and p will be

> #simplified (time but not cohort specific PIMS)

> PIMS(Phi.sex.T_p.sex.T,"Phi")

group = sexFemale

1980 1981 1982 1983 1984 1985

1980 1 2 3 4 5 6

1981 2 3 4 5 6

1982 3 4 5 6

1983 4 5 6

1984 5 6

1985 6

group = sexMale

1980 1981 1982 1983 1984 1985

1980 7 8 9 10 11 12

1981 8 9 10 11 12

1982 9 10 11 12

1983 10 11 12

1984 11 12

1985 12

> PIMS(Phi.sex.T_p.sex.T,"p")

group = sexFemale

1981 1982 1983 1984 1985 1986

1980 13 14 15 16 17 18

1981 14 15 16 17 18

1982 15 16 17 18

1983 16 17 18

1984 17 18

1985 18

group = sexMale

1981 1982 1983 1984 1985 1986

1980 19 20 21 22 23 24

1981 20 21 22 23 24

1982 21 22 23 24

1983 22 23 24

1984 23 24

1985 24

>

That is, we have 24 parameters, 12 for Phi (2 groups *6 occasions) and 12 for p.

When we build a model in RMark, we put together 2 things.

First, depending on what we specify, RMark will initialize a PIM structure for Phi and p. If we include both group and time, the PIM structure will look like the one above (e.g., 24 parameters)
Then, depending on the specific model formulation, RMark applies a Design Matrix to this PIM structure. This can be viewed as an element (attribute) of the model object, in this case

> Phi.sex.T_p.sex.T$design.matrix

We will display portions of this matrix below, but let's check out its dimensions

> dim(Phi.sex.T_p.sex.T$design.matrix)

[1] 24 24

That is, the DM has 24 rows and 24 columns. Again, the rows correspond to the PIM numbers, and the columns to the parameters estimated. This is exactly the same as with MARK so nothing new here. Examine the Phi and p portions of the DM by displaying the upper and lower 12X12 submatrices

> Phi.sex.T_p.sex.T$design.matrix[1:12,1:12]

and

> #Design matrix for p portion

> Phi.sex.T_p.sex.T$design.matrix[13:24,13:24]

You should note the following for each submatrix:

The other submatrices (lower left and upper right) are all zeros because that parameter type (rows) is not involved with the parameters (columns)
Each parameter type has "1" in the first (intercept) column of its submatrix
The other columns in the submatrix are groups effects (in this case, 0 for females and 1 for males), time effects (5 effects because there are 6 intervals), and group *time interactions

Finally, we can see estimates of the model parameter by looking at the object. For the design matrix parameters we have

> #parameter estimates

> Phi.sex.T_p.sex.T$results$beta

and for real parameters

> Phi.sex.T_p.sex.T$results$real

Note that there are 24 parameters estimated, and that the number of 'real' (PIM) and 'beta' (DM) parameters is the same.

Now let's create a model based on this PIM structure. An obvious choice is additive ses and time for both Phi and p:

> Phi.sex.t_p.sex.t=mark(dipper.processed,model.parameters=list(Phi=Phi.sex.t,p=p.sex.t))

If we check out the PIM structure we notice it is the same as the previous model (12 Phi, 12 p parameters, because we still are including both group and time effects.

>PIMS(Phi.sex.t_p.sex.t,"Phi")

>PIMS(Phi.sex.t_p.sex.t,"p")

However, the DM has changed, and now has 14 and not 24 columns.

> dim(Phi.sex.t_p.sex.t$design.matrix)

[1] 24 14

Why? Because while we kept the intercept, group, and time effect columns (7 columns for each parameter type) we got rid of 5 group*time columns for each =10 total. You can display the DM for each parameter type:

> #Design matrix for Phi portion

> Phi.sex.t_p.sex.t$design.matrix[1:12,1:7]

> #Design matrix for p portion

> Phi.sex.T_p.sex.T$design.matrix[13:24,13:24]

Finally, we can see that while we now get 14 beta estimates:

> Phi.sex.t_p.sex.t$results$beta

We still get 24 real parameter estimates.

> Phi.sex.t_p.sex.t$results$real

That is, RMark is giving us estimate of all 24 of our real (PIM) parameters-- but based on a model with 14 parameters (additive sex and time effects).

Suppose we drop the group (sex) effect out of our models and only incorporate time.

> #time only models

> Phi.t_p.t=mark(dipper.processed,model.parameters=list(Phi=Phi.t,p=p.t))

First, when you display the PIMs for this model you will now see that there are 12 parameter total- 6 for Phi and 6 for p. The indices vary over time, but repeat between groups.

>PIMS(Phi.t_p.t,"Phi")

>PIMS(Phi.t_p.t,"p")

If we examine the dimensions of the DM for this model, it now has 12 rows and 12 columns:

> #dimensions of design matrix

> dim(Phi.t_p.t$design.matrix)

[1] 12 12

> #Design matrix for Phi portion

> Phi.t_p.t$design.matrix[1:6,1:6]

#Design matrix for p portion

Phi.t_p.t$design.matrix[7:12,7:12]

We can display estimates for this model, revealing that there are 12 beta and 12 real parameter estimates.

> Phi.t_p.t$results$beta

> Phi.t_p.t$results$real

As above, we could create a simplified (few columns) version of this with the DM, but it would have to contain time effects if we were going to keep the same PIM structure. An example where this would work is if we had a time specific covariate that modeled time variation in Phi or p by a linear logit model, an approach we'll come back to later.

Finally, we'll put group back in but drop time out of the model:

> #sex only models

> Phi.sex_p.sex=mark(dipper.processed,model.parameters=list(Phi=Phi.sex,p=p.sex))

Examining the PIM structure we can see that there are now only 4 parameters ( 1 Phi and 1 p parameter for each group)

> PIMS(Phi.sex_p.sex,"Phi")

group = sexFemale

1980 1981 1982 1983 1984 1985

1980 1 1 1 1 1 1

1981 1 1 1 1 1

1982 1 1 1 1

1983 1 1 1

1984 1 1

1985 1

group = sexMale

1980 1981 1982 1983 1984 1985

1980 2 2 2 2 2 2

1981 2 2 2 2 2

1982 2 2 2 2

1983 2 2 2

1984 2 2

1985 2

> PIMS(Phi.sex_p.sex,"p")

group = sexFemale

1981 1982 1983 1984 1985 1986

1980 3 3 3 3 3 3

1981 3 3 3 3 3

1982 3 3 3 3

1983 3 3 3

1984 3 3

1985 3

group = sexMale

1981 1982 1983 1984 1985 1986

1980 4 4 4 4 4 4

1981 4 4 4 4 4

1982 4 4 4 4

1983 4 4 4

1984 4 4

1985 4

>

Likewise, the DM is quite simple (4 x4). So there are now considered to be 4 'real' parameters, based in a model with 4 design parameters.

> Phi.sex_p.sex$results$beta

estimate se lcl ucl

Phi:(Intercept) 0.2231964 0.1440913 -0.0592225 0.5056153

Phi:sexMale 0.0416312 0.2041846 -0.3585705 0.4418330

p:(Intercept) 2.0111347 0.4211179 1.1857435 2.8365258

p:sexMale 0.4751639 0.6630006 -0.8243173 1.7746452

> Phi.sex_p.sex$results$real

estimate se lcl ucl fixed note

Phi gFemale c1980 c1 a0 t1980 0.5555686 0.0355779 0.4851987 0.6237780

Phi gMale c1980 c1 a0 t1980 0.5658227 0.0355404 0.4953193 0.6337593

p gFemale c1980 c1 a1 t1981 0.8819612 0.0438408 0.7659789 0.9446180

p gMale c1980 c1 a1 t1981 0.9231757 0.0363182 0.8149670 0.9704014

>

Finally, we will create one more model for now that illustrates the simplest possible structure -- no variation over either groups or time.

> Phi.dot_p.dot=mark(dipper.processed,model.parameters=list(Phi=Phi.dot,p=p.dot))

The PIMS for this model are very simple indeed

> PIMS(Phi.dot_p.dot,"Phi")

group = sexFemale

1980 1981 1982 1983 1984 1985

1980 1 1 1 1 1 1

1981 1 1 1 1 1

1982 1 1 1 1

1983 1 1 1

1984 1 1

1985 1

group = sexMale

1980 1981 1982 1983 1984 1985

1980 1 1 1 1 1 1

1981 1 1 1 1 1

1982 1 1 1 1

1983 1 1 1

1984 1 1

1985 1

> PIMS(Phi.dot_p.dot,"p")

group = sexFemale

1981 1982 1983 1984 1985 1986

1980 2 2 2 2 2 2

1981 2 2 2 2 2

1982 2 2 2 2

1983 2 2 2

1984 2 2

1985 2

group = sexMale

1981 1982 1983 1984 1985 1986

1980 2 2 2 2 2 2

1981 2 2 2 2 2

1982 2 2 2 2

1983 2 2 2

1984 2 2

1985 2

> dim(Phi.dot_p.dot$design.matrix)

[1] 2 2

> #Design matrix for Phi portion

> Phi.dot_p.dot$design.matrix[1:2,1:2]

Phi:(Intercept) p:(Intercept)

Phi gFemale c1980 c1 a0 t1980 "1" "0"

p gFemale c1980 c1 a1 t1981 "0" "1"

Note by the way that we could have created this particular model by a much simpler command

> Phi.dot_p.dot=mark(dipper.processed)

That is, by default (no model parameter list is specified) RMark assumes all the parameters are constant over groups and time.

Optional output to run MARK GUI

Sometimes users will want to move a problem initially created in RMark to MARK (e.g., they may be handing the problem off to someone more familiar with the MARK interface). The command export.MARK() takes a processed RMark data frame, converts it to MARK *.inp format, and creates other MARK files that can be loaded into a MARK project; optionally, it will even export a list of models already run in RMark. For the Dipper example I created a new MARK input file, "Dipper2.inp".

> #optional statements to produce input for MARK

> export.MARK(dipper.processed, "Dipper2", replace = TRUE, chat = 1, title = "Dippers Example", ind.covariates = "all")

When you run this "NULL" displays in the R console, but the new files should appear in your working directory. Note that this command is essentially the reverse of convert.inp(), which can be used to convert a MARK *.inp file to a RMark dataframe. For example

> #reversing the above to get back an RMark dataframe

> dipper2<-convert.inp("Dipper2.inp",group.df=data.frame(sex=c("M","F")))

>

All the script commands to reproduce the steps to this point are saved here.

Running and saving multiple models

Often it is convenient to create and save a large number of models for use later (which as we know is a nice feature of MARK). The basic steps to do this in RMark are fairly straightforward:

Read and process data for RMark, specifying the data/ model type
Create script to run a series of models
- Use clear notation
- Set up parameter lists ahead of time and use them for the models in various combinations
- Create models (often this will involve a great deal of copying and pasting vs. typing!
- Run the models and collect them in a model results object
- [Optional but very convenient] Assemble all the above steps into a user defined function that can be called up and run with different data or modified to add or change models
Query the model results object to compare models, investigate the estimates of individual models, perform tests, or (below) do model averaging

I have created a function called run.dipper() that does this for the 16 PIM-type models we ran earlier in MARK and uses the built-in RMark command collect.models() to create a model results object that is returned by the function. The function is invoked by

> dipper.results=run.dipper()

This displays a table of AIC results similar to that in MARK; the dipper.results object also contains the specific details and results for each model. One thing to note is that that whereas for the simpler (top ranked) models the AIC values displayed by MARK and those in RMark are the same, they diverge for certain of the lower-ranked models. The reason for this is that RMark always bases parameter counts for AIC on the columns of the design matrix, which for the Phi(g*t)p(g*t) model is 24 for example. MARK by contrast tries to estimate how many parameters there should be based on the data structure, which for this model is 22 (since Phi(k-1) and p(k) cannot be uniquely estimated). This should make little difference in practice but is something to be aware of. We will generally use the parameter counts based on RMark for AIC and other purposes.

As mentioned, we can either view summary information using the results object, or drill down into it for more details on specific models:

> #get particular model

> results<-dipper.results$phi.dot_p.dot

> summary(results)

#see all the attributes of the object

> attributes(results)

The R script to run these models and form the results object is here.

I will assume that you've run this script and created the dipper.results object, and are still in the R session with that object in memory, before moving to the next section on model averaging. If you need to leave the R session then you can avoid having to re-do all the models by saving the R object:

>save(dipper.results,file='dipper.results')

saves to your working directory an R image called 'results'

>load('dipper.results')

gets the dipper.results object back into your memory

Model comparison, model averaging, and displaying results graphically

Again, we can easily rank a collection of models run on a given data set using AIC (adjusted for small sample AICc or, if needed, for overdispersion QAICc), and sometimes that will suffice to determine that a single model is the overwhelming favorite (say 99.9% of the AIC weight). More commonly, AIC ranking is going to be equivocal, with several contenders for top model and a distribution of weights that includes a number of models with non-trivial weight. Thus more generally we are going to want to use a multi-model inference approach, for 2 reasons:

We are unsure about which model is 'true' and don't wish to base inference on a single model
We want our measures of estimate confidence (variance, CI, etc.) to incorporate the fact both that there is uncertainty about which model is true, and uncertainty (because of sampling variation) about the parameter estimates, given belief in a given model.

The basic ideas of multi-model inference and model averaging are described by Burnham and Anderson (2002; excerpt), and elaborated elsewhere; note that this are far more general than CMR models and related to models based on maximum likelihood. Fortunately, as we have already seen, MARK has some built in features that allow us to do easily perform model averaging, and RMark has similar (and in some ways easier to use) functions.

We will use the built-in RMark function model.average(), which will provide model-averaged estimates for the model real parameters. We will focus on the Phi parameters.

> mod.avg<-model.average(dipper.results,"Phi")$estimates

If you examine this object, you will see that it contains 42 rows, so it is claiming that there are 42 parameters. However, the model.average function always goes back to the most general PIM structure, which for these data is cohort and time specific ("all different"), or 42 Phi and 42 p parameters. Because we fit the CJS time-specific models, we know that all the cohorts after the first release cohort have the same time-specific survival, so we can eliminate them. It will also be convenient to split the object into "Male" and "Female" parts, so we make the following subsets:

fem<-subset(mod.avg,cohort==1980 & sex=="Female")

mal<-subset(mod.avg,cohort==1980 & sex=="Male")

We can then create tables or plot these data for more succinct display (e.g., in a manuscript). The attached code performs the model averaging and superimposes plots of male and female survival (and approx. 95% CI) versus year based on model averaging. A second plot eliminates the last year (1985) , which had somewhat inflated CIs because of confounding between Phi(1984) and p(1985) under some models.

We will return to model averaging for making covariate-based predictions in an upcoming lab.

Next: Review exercises