Week 5: More data tricks-- working with dates, for-loops and if-else
Week 5: More data tricks-- working with dates, for-loops and if-else
Welcome to week 5. Thus far we have learned: how to read and write data in several formats; the importance of databases and their attributes; and how to create and manipulate various types of R objects including, querying, coercing, merging, and appending objects. In this lesson, we will learn how to deal with one more special type of object...(cue sinister music)... dates, and how to use “for loops” and comparison (a.k.a. logical) operators to manipulate data. Script for today's lesson.
WORKING WITH DATES
Dates are the last type of object that we will be using in this course. We save this one for last because they can be a real pain to work with, well… at least for one of us. Before we begin, download the file dates.csv to your working directory. It is a comma separated text file. Open it in your favorite text file reader (e.g., notepad, excel, word) and examine. Create a data frame “dater” by reading dates.csv into R.
## don’t forget to set YOUR working directory
setwd("C:/Users/Jims XXXXX ")
## read example data file into a data frame
dater<-read.csv("dates.csv")
## let’s see what is in the data frame
head(dater)
Just like other objects, we can query the objects (think: columns) in the data frame using the class function. For example,
## what are the type of objects in the data frame
class(dater$date.m.d.y)
Hmmm… let’s think a bit about this. You’ve use the “is.” function to find out if something was numeric, is.numeric, and character, is.character. We also learned that we can coerce objects using the “as.” function. What can we do to change date.m.d.y into a date? If you guessed as.Date you get a gold star. The as.Date function coerces character and factor objects into dates based on their syntax. Let’s looks at the syntax of date.m.d.y.
#print out columns
dater$date.m.d.y
You should have noticed that the values were arranged in month, day, year order and that the values were delimited using a slash “/”. You need to specify the correct order and the delimiter in the as.Date function using the “format = “ option. The function uses specific codes to refer to the format of the day, month, and year shown in the table below.
We see from the table that %d represents day, %m is the decimal number for month, and %Y is used for year in the 4 digit format, e.g., 2013. OK now we’re ready to specify the format of the date in date.m.d.y. We have month in decimal format, day, and year in 4 digit format, so format="%m/%d/%Y". Now we have:
#create Date1 using values in dater$date.m.d.y
Date1 <- as.Date(dater$date.m.d.y, format="%m/%d/%Y")
# print it out
Date1
# first coerce dater$date.m.d.y to become character
# then coerce to become date
Date1 <- as.Date(as.character(dater$date.m.d.y), format="%m/%d/%Y")
#check on the type of object we created
class(Date1)
Ok, what if my data are in day, month, year format separated by a forward slash “/”? It just so happened that the data in dater$date.d.m.y are in that format. We have day, month in a decimal format, and year in a 4 digit format all separated with a forward slash To coerce into a date we should:
# first check the class of the object (this is good practice)
class(dater$date.d.m.y)
# create Date2 by coercing dater$date.d.m.y notice the format is day
# month year
Date2 <- as.Date(dater$date.d.m.y, format="%d/%m/%Y")
#print it out
Date2
# check the class again for grins
class(Date2)
Try coercing the remaining objects in dater.
##First let’s try dater$date.y.m.d print it out and examine
dater$date.y.m.d
## looks like year (4 digit) month, day try coercing using the code below
as.Date(dater$date.y.m.d, format="%Y/%m/%d")
What happened when you submitted the above code? You should have obtained a bunch of NA and some nonsense dates. This is because we misspecified the delimiter. We used a forward slash instead of a dash or minus sign “-”. This type of output, wrong dates and NAs are usually a good sign that you misspecified the date format. Lets do it correctly this time using a dash for the delimiter.
# create Date3 by coercing dater$date.y.m.d notice the use of dashes
Date3 <- as.Date(dater$date.y.m.d, format="%Y-%m-%d")
### print out next object. Notice the values are delimited using a period
dater$date.d.m.y.2
# create Date4 by coercing dater$date.y.m.d notice the use of periods
Date4 <- as.Date(dater$date.d.m.y.2, format="%d.%m.%Y")
### print out next object dater$date.m.d.y.2.
dater$date.d.m.y.2
Notice the values in date.m.d.y.2 are delimited using dashes also that the format for month is now an abbreviation. In the table above, the abbreviated month format is %b, so we change the format in the as.Date function like:
# create Date5 by coercing dater$date.m.d.y.2 notice the use %b and dashes
Date5 <- as.Date(dater$date.m.d.y.2, format="%d-%b-%Y")
Once dates are in the proper format, you can perform a variety of R functions, such as summarizing data by date, calculating the mean, median, minimum and maximum dates (more next week) or simple functions like addition or subtraction. For example,
# difference in days between Date1 and Date2
Date1 - Date2
Another useful function is difftime, which calculated the difference between 2 dates. Let’s see what this function can do using the help function.
help(difftime)
Here we see that we can specify the difference units in terms of days, weeks, hours, and seconds. For example, we can calculate the difference between Date2 and Date1 in terms of weeks.
difftime(Date1,Date2, units = 'weeks')
Hmmm… you’re wondering. Why can’t I simply divide the difference between Date2 and Date1 and divide by 7? Go ahead, try it and find out what happens (don't be sceered): e.g.,
(Date1 - Date2)/7
Now, why would I want to calculate intervals between dates? Well maybe you are going to fit a hazards model or maybe you need to calculate a common time interval for an open population capture-recapture model. There are several reasons to do this in ecological studies, so it’s important to know that this is possible.
Quite often, we want to extract a part of a date, such as a year, or a month. We can do this using the format function as shown below. Be aware, however that the objects produced by format are characters, so they need to be coerced if you want them in numeric format, e.g.,
# create year by stripping the year value in Date1 and make it numeric
year <-as.numeric(format(Date1, format = "%Y"))
# create month by stripping the month value in Date1 and make it numeric
month <-as.numeric(format(Date1, format = "%m"))
## create month.char by stripping the month value in Date1 and output
## in abbreviated character format
month.char <- format(Date1, format = "%b")
# create day by stripping the day value in Date1 and make it numeric
day <-as.numeric(format(Date1, format = "%d"))
We can also create on object with day of year (1-365) a.k.a. Julian date using strftime and specifying format = “%j”. As above, this object must be coerced to make it numeric:
## create Julian date from Date1 and coerce into numeric
Julian<-as.numeric(strftime(Date1, format = "%j"))
The strftime function has several other uses, such as extracting hours, minutes, and seconds from objects with date time values. To find out more, we use the help function.
Ok-- so we can coerce a date from a character or factor format and we can extract parts of dates. What about creating a date from a objects containing day, year and month? Fortunately, this is easy using the ISOdate function. For example, we can create a date object using the year, month, and day objects.
# create object my.date using year, month, and day created above
my.date<-ISOdate(year,month,day)
#print it out
my.date
The order of the year, month, and day inside the matters in the ISOdate function, so you should always check the function format and options using help. You should have noticed that ISOdate creates an object that contains times too. This is useful if you have date and time data, but if you do not it can be annoying. We can reformat my.date using strptime. The format should match the year month day format of my.date:
## just extract the year month and day from my.date
strptime(my.date, "%Y-%m-%d")
There are several other neat tricks that we learned earlier working with characters and numbers that will also work with dates. For example, we can create a sequence of dates that begin with June 1, 2000 to August 8, 2000, for a specified interval using the seq function:
#output dates every 2 weeks
seq(as.Date('2000-6-1'),to=as.Date('2000-8-1'),by='2 weeks')
#output dates each day
seq(as.Date('2000-6-1'),to=as.Date('2000-8-1'),by='1 days')
Here we have only scratched the surface of what can be done with dates in R. This should be enough for most applications. If not, you will need to look into using one of several available packages, such as the date and lubridate packages.
COMPARISON (LOGICAL) OPERATORS
When manipulating data or simulating ecological processes, we often need to use a comparison operator to compare two or more values and performing an operation if values meet certain specifications. We do this using an “if” statement. The first step is to determine what sort of comparison we want to make. Let’s say I want to know if the value of an object, abundance, is greater than, say 0.
# create abundance and assign value of 10
abundance = 10
# determine if abundance is greater than zero
abundance > 0
If the condition is met the comparison operator will return TRUE. Next we want to create a new variable, present and assign it a 1 if abundance is greater than 0 using an if statement
# make comparison
if(abundance > 0) present = 1
## print out
present
What do you think would have happened if abundance was equal to zero? Go ahead and try it out. but first we need to remove the present object using….. come think back to week 1….. the remove or rm function
#remove present
rm(present)
#set abundance to zero
abundance = 0
# make comparison
if(abundance > 0) present = 1
## print out
present
What happened? Well if you did it right you should have gotten the following error message:
Error: object 'present' not found
This is because the condition in the if statement was not met (= FALSE) so the action that came after the if statement was not executed (it wasn’t run). We need a way to modify the if statement to include what else to do if the comparison is FALSE. What else to do, what ELSE to do? How about use an else statement? Let’s try:
# make comparison
if(abundance > 0) present = 1 else present = 0
## print out
present
The above is the basic format for the if-else statement. However, it does not cover the instances where you want to do multiple things if the condition is, or is not met. To do that, we need to use curly brackets “{ }” to delineate the actions we want to accomplish. For example, let’s say that we want to create another variable occupied = 1 if abundance is greater than 0 and occupied = 0 if abundance less than or equal to zero.
# make comparison
if(abundance > 0){ present = 1
occupied = 'yes'
} else{ present = 0
occupied = 'no'
}
## combine and print out
c(present, occupied)
Change abundance equal to 10 and re-run the above code. Notice that the program executes everything inside the brackets depending if the comparison is TRUE or FALSE. What if we had 2 comparisons to make. For example, let’s say we wanted to create a variable season based on the months of the year. First create a date object new.date as 5/13/2015:
new.date = as.Date("5/13/2011", format = "%m/%d/%Y")
Now let’s assign season = spring if month is May to June. So think about it.... May is the 5th month and June is the 6th month, so if month is between (not including) 4 and 7 the day in new.date should be in the spring. We can use the "&" operator to select the instance where month is greater than April (month 4) and less than July (month 7)
### make comparison and assign season
if(as.numeric(format(new.date, format = "%m"))> 4 & as.numeric(format(new.date, format = "%m")) < 7) { season = 'spring'
## if its not spring it is another season
} else{season = 'other'}
#print it out
season
Now sets assign a new date to new.date as Feb 11 and assign season = winter if the month is between November and April.
new.date = as.Date("2/11/2013", format = "%m/%d/%Y")
# just modify the above code
if(as.numeric(format(new.date, format = "%m"))< 4 & as.numeric(format(new.date, format = "%m")) > 11) { season = 'winter'
} else{season = 'other'}
#print it out
season
Did that work? Hopefully you said NO! That’s because we used the & and asked if the month was less than 4 (April) AND greater than 11 (November). What we really want to ask is if the moth is was less than 4 (April) OR greater than eleven. The OR is symbolized by the pipe “|” so we have:
if(as.numeric(format(new.date, format = "%m")) < 4 | as.numeric(format(new.date, format = "%m")) > 11) { season = 'winter'
## assign other to non-winter months
} else{season = 'other'}
#print out
season
There are several operators for making comparisons in R and these are shown the the table below
Cool beans! Now let’s try some comparisons using the Date1 vector. First, let’s see which dates in Date1 fall in the second half of the month, that is, the days greater than or equal to 15.
# identify days greater than equal to 15 (TRUE)
as.numeric(format(Date1, format = "%d")) >= 15
Now let’s create an object month.part that is assigned a value of 2 if the day is in the second half of the month and otherwise equals 1 using the if else function.
# identify days greater than equal to 15 (TRUE)
if(as.numeric(format(Date1, format = "%d")) >= 15) month.part = 2 else month.part = 1
What happened after you submitted the statements? You should have gotten the following error:
Warning message:
In if (as.numeric(format(Date1, format = "%d")) >= 15) :
the condition has length > 1 and only the first element will be used
This is because we tried to use a scalar function "if else" on a vector. In other words, if-else functions can only make comparisons on one value at a time. For example, recall that we can use the bracket notation [] to refer to specific elements in a vector. For example, the first date in Date1:
Date1[1]
We could use that notation to make the comparison one at a time. For example
# identify if first day greater than equal to 15 (TRUE)
if(as.numeric(format(Date1[1], format = "%d")) >= 15) month.part = 2 else month.part = 1
We could do that as many times as there are elements in Date1 one at a time. This would be a real pain for large datasets and would make for very large programs. This also is a repetitive task. As you may have guessed, there is a function that we can use to perform repetitive tasks, for loops.
FOR LOOPS
The basic syntax of a for loop in R is
for(i in min:max) {
do something max minus min times
}
The for loop is delineated by brackets and anything inside of the brackets it executed the number of times specified by the min and max. For example, lets assign a value of 10 to an object Y and add 2 to it for 10 iterations
Y = 10
for(i in 1:10) {
# add 2 to Y
Y = Y + 2
# print it out?
Y
}
#print it out last value
Y
Notice that the for loop does not print out each of the 10 steps to the console. To do that, we need to add a print statement like below.
Y = 10
for(i in 1:10) {
# add 2 to Y
Y = Y + 2
# print it out!!
print(Y)
}
Now we able to see the changes in Y through each step. What about saving the values of Y in an object. That should be easy. We simply declare a vector variable Z and add values of Y to Z at each step. For example,
Z=c()
Y = 10
for(i in 1:10) {
# add 2 to Y
Y = Y + 2
# save values of Y each step
Z = c(Z,Y)
}
# print it out
Z
For loops really come in handy when we do ecological simulation and few weeks from now. For example, let’s take an initial population size (N) of 100 and model populations for 10 years assuming a population growth rate lambda = 1.05
#set initial population size
N = 100
# create a place to put the simulated time series data
# star with year = 0 and initial population size N
time.series=c(0,N)
## growth rate
lambda = 1.05
# Yearly time step loop for 10 years
for(year in 1:10) {
# grow the population
N = N*lambda
# save values
time.series = rbind(time.series,c(year,N))
}
# print it out
time.series
Notice that year increases from 1 to 10 and that year is an R object known as the for loop index. We can actually use the for loop index in calculations or other ways an scalar value is used. For example, we can use the index to refer to a specific column or row in a matrix or vector. Let’s say that we wanted to compare the population size from one year to the next in our time.series. This would entail comparing values in one row with values in an adjacent row. The script below does just that. It compares values in row i with values in i-1 and assigns a value to trend. Run the script.
## create plane to hold trend assessment
trend= c()
for(i in 2:11){
## population size is in column 2
## compare pop size to previous pop size and create trend object
if(time.series[i,2] > time.series[i-1,2]) trend[i] = "increasing" else trend[i] = "decreasing"
}
# print it out
trend
Getting back to our motivation for modeling repetitive tasks, we originally wanted to create an object month.part that is assigned a value of 2 if the day in Date1 is in the second half of the month and otherwise equals 1 using the if else function. To do, this we could use a for loop like we did with the time.series data and use the for loop index to refer to the individual values in Date1:
#how many elements in Date1
length(Date1)
# create the object month.part
month.part = c()
for(i in 1:length(Date1)){
# identify if first day greater than equal to 15 (TRUE)
if(as.numeric(format(Date1[i], format = "%d")) >= 15) month.part[i] = 2 else month.part[i] = 1
}
# print it out
month.part
Well that worked! It was a bit clunky but demonstrates one of the many uses of for loops. We will be using for loops for the remainder of the course so you will get plenty of opportunities to use them and learn more tricks-o-the trade. But, here’s one quick one before we leave this subject today. We mentioned that for loops execute until the index reaches the maximum value indicated in the parentheses. There are times that we want to escape from a for loop before it finishes. For example, let’s say were simulating a population and it goes extinct, N = 0. We don’t want to continue simulating because nothing is going to happen, so we need to jump out of the for loop. Here’s one way to do that using break.
## initial population size
N = 10
## create place for population time series
popn = c(0,N)
## population growth rate
lambda = 0.85
# for loop with index year
for(year in 1:100){
N = N*lambda
# save population data
popn = rbind(popn,c(year,N))
# if population size is less than 1 break out of loop
if (N < 1) break
}
# print it out
popn
VECTOR OPERATORS
The preceding use of a for loop to make comparisons with vector data was a bit clunky and with large datasets it would be sloooooow. A more efficient was to make comparisons is by using a vector operator. Here the function is ifelse The syntax for ifelse is:
ifelse(comparison, value if true, value if false)
Going back to our previous example, we have:
ifelse(as.numeric(format(Date1, format = "%d")) >= 15, 2,1)
That is much more efficient that creating for loop to make comparisons. R has several matrix and vector operators that are faster and will make your programs shorter and easier to code and run. We will learn more about some of these next week.
Week 5 Assignment
Due 1 week from today by 5pm Pacific. Download “habitat.csv” and “critter catch.csv” and complete the following:
1) Combine the two datasets into a single dataset. (Hint... think merge, also note that “Date” is the only common column in the two data frames)
2) Using a for loop and an if statement , create a new variable, ”presence” in the combined data set that consists of 0 when the species is absent, 1 when the catch is greater than 0 but less than 10 and 2 when the catch is greater than or equal to 10.
3) Create another new variable “presence2” using the ifelse() function and the same rules as #2 above.
4) Write the resulting data frame to a tab delimited file.
5) Write R code that begins with a scalar object “X” that equals zero, adds 1.2 to the object and multiplies the sum by 0.75. The process should be repeated (i.e., adds 1.2 to the object and multiply the sum by 0.75) for 10 iterations. Please create an vector object that contains the value of X after each iteration and write the vector to a comma separated file.
Please save all of the code that you used in a single script and submit the script in an email attachment to Jim.