Formatting data to run in MARK or RMark

What do the data need to look like?

As indicated earlier, there are several possible formats for CMR data (and for input into MARK) but we will start with the most common one: the standard capture history record. Generically, the records have these features:

Each animal (or unique capture history) has a row in the data
The columns signify capture / recapture occasions
A entry of 1 signifies the animal was captured on a particular occasion.

These are the basics, but as noted earlier MARK and RMark have some slight differences in how they handle input data. In either case, one could create an input file (a text file with .inp extension for MARK or a csv or other data input file for RMark) by hand, using a text or other editor. This is ok for very small problems but will be tedious and prone to errors for large data sets. Pivot tables in Access or other software can work with some fiddling, but generally we are going to want to use R to organize and format our data, either for input into MARK or (more directly) into RMark. But first let's talk about how your data should be handled before this step.

Getting from field data to analysis

Proper field data recording

A first step (and key to your ultimate success) is recording the field data from the get go in a format that

contains all the information you will need for CMR analysis
allows maximum flexibility (e.g., if you change your analysis you will still have data you can use!)

I can't emphasize enough how important this is. I have seen too many instances of data entry schemes that looked clever, colorful, or were otherwise interesting, but were practically useless for analysis. Do you really want to have to re-organize your data line by painstaking line? I didn't think so. So pay attention.

First, despite the fact that your data is eventually going to look like an array (n rows by k columns and some other stuff), you almost certainly do NOT want to enter field data this way into a data base. Your data needs to retain information at the level you collected it in the field, and with the attributes you measured. You will use programs to organize the data, but only if you follow some simple rules.

Ordinarily enter animal captures one animal at a time, recording animal identifier (tag number or other identifier), date, time, animal attributes (age, sex, mass, etc.). You may also enter in relevant environmental variables (temperature, etc.) but if these are common to the entire capture period (e.g., cloud cover when you started trapping) it may be better to enter a single record for these data for the occasion, and merge it with your capture data (if needed) later.
Use consistent symbols and terms and date / time formats. This is particularly important if you have multiple people entering data. Statistical programs do not do spell check; R is case sensitive.
- Examples that will cause you problems:
  - You call an animal A001 one time and 001 another
  - You put the time as 1230 one day and 12:30pm another
  - You record date both as 1/2/2014 (is this January or February??) and 2 Feb 2014.
  - You call the habitat Oak one time and red oak another
- If you have a lot of these types of choices, it may be better to use a data base program or other means that provides choices through templates, drop down lists, etc. Otherwise, just be careful!
Avoid blanks in data sheets and fill with NA instead (or have a program do this)
Avoid scattering your data around different parts of a spreadsheet, or entering observations horizontally (across the columns) rather than down vertically (down the rows).
In any case it is always a good idea to run some simple summary statistics on your data before any serious analysis. For instance, a frequency table (table function in R) of "habitat" could show that you have 99 instances of "Red Oak" and 1 instance of "red oak", so you'd want to combine these. Or that one of your date entries was for a year before you were born

Programs in R to format data for software programs (MARK and RMark)

Here I illustrate by way of a simple, artificial example the conversion of field data in a format like that above, capture history format usable in programs MARK or RMark. The first few lines of the data which is called "field.data.csv" look like this

date

date animal_id

1-Jul-14 7

1-Jul-14 11

1-Jul-14 14

1-Jul-14 22

1-Jul-14 24

The first column is a character string representing date (day-mo-year) and the second column is a 2-digit code for the animal's id (e.g., tag number). Our first step is to read the data into an R data frame

>require(reshape)

>require(RMark)

>#set the directory you are using

>setwd("c:/users/mike/dropbox/teaching/software course")

>################################################################

>#read from data file and create input for RMark

>input.data<-read.csv("field.data.csv")

The first couple of lines specify a couple of libraries that will be needed down the road; if these programs are not installed you will need to download and install them. Then set the working directory where the data are located. Finally this data area read into an R data frame called input.data.

The next step is to convert the date codes used on the field data sheet to a date object recognized by R as a numerical date-time value. The function strptime() does this and operates as follows for these data:

#convert to R date format

#data look like this 1-jul-2014

#resulting date object is 2014-07-05 (year then month then day)

input.data$date<-strptime(input.data$date,format="%d-%b-%Y")

The first few lines of the reformatted input file look like this:

> head(input.data)

date animal_id detect

1 2014-07-01 7 1

2 2014-07-01 11 1

3 2014-07-01 14 1

4 2014-07-01 22 1

5 2014-07-01 24 1

6 2014-07-01 34 1

This new date format will be more useful in many calculations in R, for instance in calculating elapsed days since an initial date (such as the start of the study) or in grouping multiple days or other intervals into single capture occasions.

We now need to pivot the data to resemble capture histories. We do this by a series of 'reshaping' function in r. In the first we 'recast' the data into a matrix by means of the the relationship between animal id (which will ultimately by our rows) and date (columns).

> #reshaping functions to pivot the data

> junk<-melt(input.data,id.var=c("animal_id","date"),measure.var="detect")

> y=cast(junk,animal_id ~date)

The first few lines of this matrix look like this

animal_id 2014-07-01 2014-07-02 2014-07-03 2014-07-04 2014-07-05

1 1 NA NA NA 1 NA

2 3 NA NA NA 1 NA

3 4 NA NA 1 NA NA

4 5 NA NA 1 NA NA

5 6 NA NA NA NA 1

6 7 1 NA NA 1 NA

This is beginning to look like what we need, but notice that all the days that we didn't capture animals show up as "missing" (NA), whereas we need them to be 0's. This is easy to fix with a simple command that checks each element of y and sees if it is missing ( is.na) . If it is, it replaces the element with a 0.

> #fill in all the days when the animal wasn't seen with zeros

> y[is.na(y)]=0

> #

> #Now y is a matrix with 82 rows (each individual animal) and 5 days

The first few rows of the revised matrix are now

animal_id 2014-07-01 2014-07-02 2014-07-03 2014-07-04 2014-07-05

1 1 0 0 0 1 0

2 3 0 0 0 1 0

3 4 0 0 1 0 0

4 5 0 0 1 0 0

5 6 0 0 0 0 1

6 7 1 0 0 1 0

>

Now we are getting close, but not quite there. What we have above is a series of animal id's followed by lists of 5 1's and 0's. What we actually need is a simple capture history string for each animal that collapses the 1 0 1 1 1 etc structure down into a single character string with no spaces. I have written a small user-defined called pasty() to do this.

>

> #function to read the matrix and create the capture history strings

> pasty<-function(x)

+ {

+ k<-ncol(x)

+ n<-nrow(x)

+ out<-array(dim=n)

+ for (i in 1:n)

+ {

+ y<-(x[i,]>0)*1

+ out[i]<-paste(y[1],y[2],y[3],y[4],y[5],sep="")

+ }

+ return(out)

+ }

This is then used to create an RMARK data frame:

> #capture history data frame for RMark (FINALLY)

> capt.hist<-data.frame(ch=pasty(y[,2:6]))

Here's what the first few lines of this look like know

> head(capt.hist)

ch

1 00010

2 00010

3 00100

4 00100

5 00001

6 10010

This data frame can now be used directly for input into RMARK. For example the commands

> #run a MARK model

> ##we'll elaborate on this later and build more models!!

> results<-mark(capt.hist,model="Closed")

will run the default closed model for these data. More on this later.

If desired, we can also produce files that can be input into standard MARK. The export.MARK() function in RMark does this for you:

> #create files that can be imported into MARK

> closed.proc=process.data(data = capt.hist, model = "Closed")

> export.MARK(closed.proc, "example",

+ replace = TRUE, chat = 1, title = "Class example",

+ ind.covariates = "all")

NULL

>

It looks like nothing happened ("NULL" output) but look in your working directory. You should see a file example.inp ("example" was just a name I picked, you can pick any other; "inp" is the extension necessary for MARK. This turns out to be just a text file

00010 1;

00100 1;

00001 1;

10010 1;

The "1" above indicates that one animal was captured with that capture history, but note below that we can in fact have more than one animal with the same history (like the first 2 and the next 2 records above).

The above code for provides input for RMark and MARK in as Individual animal capture histories, with one line per individual animal. MARK also can read summarized capture histories, in which the records are capture history strings that represent more than one animal; the "1" column to the right of the capture history is then replaced by the frequency of animals with the history.

I have written some code to form summarized capture histories and applied it to the example capt.hist data frame:

> #the above does only individual ch records

> # here is one way to form summary recordd for mark

> #make a table summarizing the capture histories

> x<-as.data.frame(table(capt.hist))

> #add a string of semicolons

> x$end<-";"

> #write this out to a file for input into MARK

> write.table(x,file="mark.summary.inp",quote=F,row.names=F,col.names=F)

>

> x

This produces summarized histories which are then exported to an .inp file for use in MARK.

capt.hist Freq end

1 00001 7 ;

2 00010 11 ;

3 00011 3 ;

4 00100 15 ;

5 00101 3 ;

6 00110 4 ;

7 01000 2 ;

8 01001 1 ;

9 01010 3 ;

10 01100 3 ;

11 01101 3 ;

12 01110 3 ;

13 10000 6 ;

14 10001 1 ;

15 10010 4 ;

16 10100 3 ;

17 10110 2 ;

18 11000 5 ;

19 11010 1 ;

20 11100 2 ;

Mostly there is little need for this, and in any case RMark is not set up for summary records (MARK is). However, in some cases there is a tremendous number of records with a lot of capture history repetition, so this format can be more efficient computationally. I will also show you later how to create the MARK input format files from within an RMark session.

Group factors and individual covariates

As we proceed with examples we will see a number in which the capture data are stratified by groups (e.g., sex or age) or continuous attributes are recorded (such as length or mass) that are later used in modeling. RMark and MARK somewhat overlap in how they handle groups or continuous covariates. We will start out with a simple group covariate (sex = female or male) for illustration and then in a future exercise get into more complex examples (multiple attributes, continuous covariates, or both).

To illustrate, take the previous example data, to which a sex field has been added; the data file is here.

etc. I have written code to read these data into a data frame and format for RMark. As before we do the date conversion

> #let's take a case where we have a group covariate (say, sex)

> input.data<-read.csv("field.data.groups.csv")

> #convert to R date format

> #data look like this 1-jul-2014

> #resulting date object is 2014-07-05 (year then month then day)

> input.data$date<-strptime(input.data$date,format="%d-%b-%Y")

Here we are going to create a small data frame that contains the 'sex' value for each unique animal id. We do this because we are going to need to merge this back with capture history strings when we build the RMark data frame, and we need just 1 'sex' variable per animal (not repeated across capture occasions like it is in the input).

> group.df<-as.data.frame(with(input.data,table(animal_id,sex)))

> group.df<-subset(group.df,Freq>0,select=c(sex))

>

We form the capture history strings as before

> #just creating an indicator to show a capture occurred

> input.data$detect<-rep(1,nrow(input.data))

> #reshaping functions to pivot the data

> junk<-melt(input.data,id.var=c("animal_id","date"),measure.var="detect")

> y=cast(junk,animal_id ~date)

> #fill in all the days when the animal wasn't seen with zeros

> y[is.na(y)]=0

but now these are merged in a data frame with the group variable sex

> capt.hist<-data.frame(ch=pasty(y[,2:6]),sex=group.df$sex)

The first several records look like this

> head(capt.hist)

ch sex

1 00010 female

2 00010 female

3 00100 female

4 00100 female

5 00001 female

6 10010 female

and the last several like this

> tail(capt.hist)

ch sex

77 11010 male

78 00110 male

79 00101 male

80 00100 male

81 01110 male

82 00010 male

As before, we can run this into some RMark models; here is a simple one for starters.

> #run a MARK model

> ##we'll elaborate on this later and build more models!!

> results<-mark(capt.hist,model="Closed",groups=("sex"))

Finally, we can turn this into MARK input with the built in function:

> #create files that can be imported into MARK

> closed.proc=process.data(data = capt.hist, model = "Closed",groups=("sex"))

> export.MARK(closed.proc, "example",

+ replace = TRUE, chat = 1, title = "Class example",

+ ind.covariates = "all")

NULL

This produces input strings

00010 1 0;

00100 1 0;

00001 1 0;

for the first few histories and

11010 0 1;

00110 0 1;

00101 0 1;

00100 0 1;

01110 0 1;

00010 0 1;

for the last several, with 1 in the first trailing column indicating female and 1 in the second indicating male. We could also write code to produce MARK summary input (frequencies) by extension of what we did for ungrouped data.

Other formats

We will encounter (no pun) other formats for CMR data as we proceed that offer variations on the above theme. A main one that we will see later is the "LDLD" format, in which the capture occasions each have 2 columns instead of 1, and the first (L) column is a "1" if it is a live capture or recapture occurs and the second column is "1" if a dead recovery occurs in that occasion. This format is needed for tag (band) recovery data and joint live-dead (mark recapture-recovery) data. We will see more about how to build such a format from input data later.

Other formats tend to be more specific to particular situations, and some of these are implemented in MARK but not RMark. These include:

Summarized encounter data (mentioned earlier)
Recapture and recovery in 'm-array' format
Special nest success study format
Multi-state format

We will return to some of these later.

The R code to read in the examples and perform the formatting is here. Note that this code, after running a model in RMark, uses the export.MARK function in RMark reformat and save the data in MARK input format.

Next: Review exercises