R for fish and wildlife grads - Week 4: Selecting, subsetting, appending data

Week 4: Selecting and subsetting data, appending, merging, sorting and duplicating fun

Learning objectives

using bracket notation
selecting and subsetting operators
getting R to do your data manipulation (merging, sorting, appending, etc.) bidding

Selecting and subsetting what is the difference?

Well they are related. In R you can select data and view it manipulate it, and so on. Subsetting takes selecting a step further and makes a new object. Remember that R is an object oriented language. So you can select parts of an object and you can subset objects to make a new object. We will start out by selecting parts of an object and this will lead into subsetting. The script for this weeks lesson is here.

Selecting

What can we select when it comes to R objects? Lets give it a shot. First we will generate some data to play with. The code below makes 3 vectors, 2 dataframes, and 1 matrix.

# lets make 3 vectors

lengths<- runif(100,30,500)

wghts<- 0.0002*lengths^3

relative_weight<- wghts/300

# lets make 2 data.frames

data1<- data.frame(year=rep(2009:2012,25), len=lengths, wghts=wghts, rel_weight=relative_weight)

data2<- data.frame(year=sample(2009:2012, 50, replace=TRUE), len=runif(50,100,800))

data2$wghts<- 0.0002*data2$len^3.01

data2$rel_weight<- data2$wghts/500

# what are the field names we can select?

names(data1)

# select the field length

data1$len # returns the data in the length field

data1$wghts # returns the date from the weight field

# lets make a matrix of random numbers

x<- matrix(runif(120,0,1),nrow=10, ncol=12,byrow=TRUE)

What about lists? I heard about them last week, aren't they kind of like having multiple objects (vectors, dataframes, matrices) in a single object?

Yep...lets make one that combines the 3 vectors, 2 dataframes, and 1 matrix from above in a list! Exciting...

mylist<- list(lengths=lengths, weights=wghts,

relative_weight=relative_weight,

data1=data1, data2=data2, matrix1=x) # now we have a list!

OK great, now how do I select that data from this infernal list? Columns from objects within a list can be selected using the $ notation similar to what we used above. Lets give it a shot---

mylist$length # the vector from above

cbind(mylist$lengths,lengths)# they are the same!

mylist$weights # the vector from above

cbind(mylist$weights,wghts)# they are the same!

mylist$data1$year # the vector from above

data1$year

cbind(mylist$data1$year,data1$year)# they are the same!

mylist$data2$len

data2$len

cbind(mylist$data2$len, data2$len)# they are the same!

Pretty easy, right?

Bracket notation

What about bracket notation? You may have seen the cool kids using it...last week

I can guaranty that if you stick with R you will use brackets at some point, and probably be glad you did. Brackets are a way to select rows or columns of a dataframe or list. We will start with a dataframe. The first thing to know is how brackets work. Brackets are really nothing to be afraid of as we learned last week, they index rows first then columns (i.e., dataframe[row,column]). Lets compare this with our $ notation from above.

Dataframes

What if we wanted to select the length field just like we did above? The following bit of code does the same as above for length but with brackets!

# SELECT THE 3RD FIELD (LENGTH) FROM THE DATA

data2[ , 3] # THIS RETURNS THE SAME AS data2$len

names(data2) # note that length is the 3rd field

data2$len

data2[ , 4]

data2$rel_weight

Where bracket notation gets convenient is when you want to select a couple of fields

data2[ ,c(2,3)] # SELECT LENGTH AND WEIGHT FIELDS

data2[ ,c(2,3,4)] # SELECT LENGTH, WEIGHT, AND RELATIVE WEIGHT

data2[ ,c(2:4)] # this is the same as the above line of code

data2[ ,c(1,3:4)] # pretty flexible

data2[c(1,4,7,19),c(1,3,4)]

In the above bit of code there is nothing in the location for rows. This tells R to just get all the rows. In this case there are 30 rows in the dataframe.

data2[c(1:30),3] # returns the same as dat[ ,3]

data2[ , 3] # same as above

Just as we did with the column indexes we can select rows that we want. For example, lets select the first 10 rows of the length and field columns.

data2[c(1:10),c(3,4)]

we can also select arbitrary rows

data2[c(1,4,7,19),c(1,3,4)]

Fun huh? You are likely curious how bracket notation might make your life easier? What if you needed to get the mean length for the rows 1, 3, 5, 7, 9, and 11.

mean(data2[c(1,3,5,7,9,11),3])

Lists

You can also use bracket notation with lists, however we won't cover these in too much detail in this lesson in this lesson. In my experience lists are rarely used in typical day to day R programming but you may like to use them as you get more proficient or if you are extracting elements from a list.

# Lists

mylist$data1$log_transformed_len<- log(mylist$data1$len)

mylist$l_weight<- log(mylist$data1$wght) # for weight

mylist$data1$l_weight

names(mylist)

names(mylist$data1)

Now we can get the various bits of data within the list using either $ or bracket notation. Lets get the means we just calculated for log length and log weight.

mylist$data1$l_weight<- log(mylist$data1$wght) # for weight

names(mylist$data1)

# adding some more to the list

mylist$mean_l_weight<- mean(mylist$data1$l_weight)

mylist$mean_l_weight

We used bracket notation before with dataframes, can we use them with list? You betcha... Lets get the first dataframe in the list, this is done using a special 'double' brackets:

mylist[[1]]# get the first object

mylist[[5]]# get the 5th object

Now to index rows and columns within the list element all we need to do is use brackes just as we did for dataframe but with the special double brackets to tell us what object we want.

mylist[[5]][1,]

mylist[[5]][,2]

Think of vectors as a single column from a dataframe. For example we could extract a column of data from our dataframe dat and assign it to object a.

victor<- data1$len

victor # the vector

We can select things from the vector. Lets select the 3rd value

victor[3]

How about selecting all the values greater than 300.

victor>300 # selects all the values greater than 300

we can even use the brackets in a function

mean(victor[victor>300])# not the same as

mean(victor)

new_victor<- victor[victor>300]

mean(new_victor)

Operators

Some useful tricks application of operators

dat<- mylist$data2

# LETS ADD A FEW VALUES

dat[rep(c(TRUE,FALSE),25),2]<- -99

dat[sort(rep(c(TRUE,FALSE),25)),3]<- -99

Assigning NA to missing values.—In the lecture for week 3, we used a -99 to fill missing values in the lesson (we didn't get to this in lecture but it is in the narrative). This next line of code selects all the cells in the dataframe that equal -99 and assigns them as NA.

dat[dat==-99]<-NA

Well that was super easy to take care of those pesky missing values. This also brings up another way bracket notation can be useful—dealing with NA values.

Dealing with NA values.—Most data will have missing values, denoted as NA in R. Probably one of the most infuriating things to do in R—besides deal with dates—is dealing with NA values. NA values are identified using the is.na() command. This command returns a TRUE/FALSE if a cell is NA. Using this command we can select the NA values. So if a cell is NA then the function returns TRUE.

is.na(dat$len)

we can use this in bracket notation as well to select the rows where length data are missing

dat[is.na(data$len),]

We can also use a conditional expression to select the rows where there length data is not NA.

dat_no_na<- dat[!(is.na(dat$len),] # super intuitive right?

dat # not the same as dat_no_na

What if we want to just drop the missing values in a vector or dataframe?

dat$len

na.omit(dat$len)

We can also use this to drop the missing value in a dataframe using vector notation

dat[na.omit(dat$length),]

A "!" is use for 'not' (see table above). A similarly useful function is complete.cases(). This function returns a True/False if all the cells in a row (i.e., record) contains values.

complete.cases(dat) # returns a vector of true/falses

dat_complete<- dat[complete.cases(dat),] # select rows with no NA values

dat_complete # look at new data without NAs

dat # the data with NAs

Subsetting: subset()

Subsetting is essentially the same as selecting, but you think about doing more sophisticated subsets—but you could also do this using bracket notation.

dat2010<- subset(data1, year==2010)

dat_after_2010<- subset(data1, year>2010)

dat_all_but_2010<- subset(data1, year != 2010)

dat_even_years<- subset(data1, year %in% c( 2008, 2010, 2012))

Using and in subsetting-R treats and using the "&" sign

dat_2010_length_100<- subset(data1, year==2010 & length > 100)

dat_2010_length_100 # FAIL! X#amp;*^, WHY?

names(data1)

dat_2010_length_100<- subset(data1, year==2010 & len > 100)

Using and/or in subsetting-R treats and using the "&" sign and "|"

new_dat<- subset(data1, len > 400 | len < 50 )

new_dat<- subset(data1, year == 2010 & (len > 100 | len < 60))

Merging: merge()

Merging is taking two dataframes and combining them based on a common index or indexes (kinda like the relational databases we talked about in week 2!). Lets make a dataset to merge with data1.

yeardata<- data.frame(year=c(2009:2012),type=c("wet","dry","extra dry", "bone dry"))

yeardata

OK done, we have a dataset that has some yearly data we could use in an analysis. What is common between yeardata and data2?

names(data1)

names(yeardata)

Well they both have year. It makes intuitive sense to merge year data with year data!

mergedata<- merge(data1, yeardata, by="year")

mergedata # cool there is our merged datasets

But what happens if all the values of one dataset are not in the other? Lets drop the 2009 year from the yeardata we set up before. We can do this using subset!

mergedata<- subset(mergedata, year>2009)

mergedata # no 2009 data!

Ok, lets see what happens when we merge that datasets again.

mergedata<- merge(data1, yeardata, by="year")

mergedata # uncool, there is no data for 2009

unique(mergedata$year)# lets make sure

We can tell R to include those records that do not have year data by specifying all.x=TRUE in the merge function.

mergedata<- merge(data1, yeardata, by="year",all.x=TRUE)

table(mergedata$year)

head(mergedata,30)# whew that is what we wanted!

We can merge based on 2 fields as well

# make some data to merge

data<- data.frame(year=sample(2009:2013, 8, replace=TRUE),

month=sample(7:9, 8, replace=TRUE),

weather=sample(letters[1:3],8,replace=TRUE))

# make some data to merge with that data

lots_o_data<- data.frame(year=sample(2009:2013, 80, replace=TRUE),

month=sample(7:9, 80, replace=TRUE),

length=runif(80, 100,600))

data

lots_o_data

lots_o_lots_o_data<- merge(data, lots_o_data, by=c("year","month"),all=TRUE)

lots_o_lots_o_data

There is a dataset that was merged based on year and month!

Appending data

Adding columns to the side: cbind()

This can be done using the cbind command to making a new object (remember we used it above to look at and compare 2 vectors?). We can do more than 2 if we want

all_vectors<- cbind(mylist$length, mylist$weight, mylist$relative_weight)

Just be careful you can cbind anything as long as the number of rows are equal!

Adding rows to the bottom: rbind()

Similarly we can append rows to a dataframe or matrix

Dataframes

names(data1)

names(data2)

fulldata<-rbind(data1,data2)

dim(data1)

dim(data2)

dim(fulldata)

Matrices

class(x) # make sure we can use a matrix

bigmatrix<- rbind(matrix1, matrix1)

dim(matrix1)

dim(bigmatrix)

Adding rows to the bottom when column names don't match: rbind.fill()

There are times when you would like to append 2 datasets that do not have every column in common. Solution? rbind.fill()

# install.packages("plyr")

require(plyr)

data1_no_len<- data1[,-2]

names(data1_no_len)

names(data2) # names don't exactly match

fulldat<-rbind(data1_no_len,data2)# fail.... errror

fulldat<-rbind.fill(data1_no_len,data2)

Sorting: using the sort() and order() functions

You can think about sorting as taking values and putting them in ascending or descending order. For example

sort(data2$year, decreasing=TRUE)

sort(data2$year, decreasing=FALSE)

year_sorted<- sort(data2$year, decreasing=FALSE)

cbind(year_sorted, data2$year) # not the same!

Ordering is similar but returns the element ids in order. Lets look at what happens when we use order on data2$year

order(data2$year, decreasing=TRUE)

order(data2$year, decreasing=FALSE)

order(data2$year,data2$len, decreasing=TRUE)

Well those most certainly are not years. They are the order of the element numbers. We can use it to make a new object that will index the rows and make a dataset that is ordered or sorted by some value.

orderid<-order(data2$year,data2$len, decreasing=TRUE)

We can take that vector and use it as a row index

data2_sorted<- data2[orderid,]

Or we can feed the function directly in.

data2_sorted<- data2[order(data2$year),]

data2_sorted

We can also order a dataframe on more then 1 variable...

data2_sorted<- data2[order(data2$year, data2$len),]

data2_sorted

Now you should be able to get R to do most anything you want it to with data!

Exercises

This exercise requires 2 input files.

1) wk4data.Rdata- review week 1 materials if you need a refresher on how to load() a *.Rdata file

2) elevationData.csv

OK to the the fun stuff. Work through the following exercises and answer the numbered items in comments within the R script.

Heads up! Name these files as follows: Lastname_wk4.R and email them to Jim by next week at 5 pm.

Exercise 1) wk4list contains 2 dataframes: siteData and catchData. Merge these two datasets based on site into a new object named megadata. Using megadata report the following:

1.1) How many rows and columns does megadata have?

1.2) What is sum of the utme column?

1.3) What is the sum of the first 20 elements of the length column in megadata (hint-brackets aren't just for holding a shelf)

Exercise 2) Doh! we forget to get elevations for each site when we were out in the field. Luckily we did record the location using our fancy GPS (which could have given us the elevation to, but hindsight it 20/20). Fortunately, we were at Bombs Away and convinced our friend Jenny GISAllStar to extract the elevation for each coordinate! She sent the data back in the file elevationData.csv. We did not send her the site names, so we cannot merge the data back with our dataset using site (double DOH!). But it can be merged by same utme and utmn coordinates. For this exercise, read in elevationData.csv as an object named elevation_data and merge it with megadata created in exercise 1 into a new object megadata_ele. megadata_ele needs to be in ascending order (smallest at top and largest at the bottom) by elevation and length.

2.1) Now how many rows and columns does megadata_ele have?

2.2) How many unique elevation values are there in megadata_ele?

2.3) What is sum of for the first 25 elements of the length column in megadata_ele?

Exercise 3) We are only interested in organisms with moderate lengths from low elevations. Make a new object megadata_mod_small that includes only the rows of megadata_ele that have an elevation less than or equal to (<=) 1500 and lengths greater than 200 and less than or equal to 600.

3.1) How many rows and columns does megadata_mod_small have?

3.2) What is the sum for the weight column (hint if you say it is NA then you are more then likely not correct)

Exercise 4) The last object in wk4list is a matrix of values varying from 0 to 1 (probs) (hint- names(wk4list)). The matrix is 30 rows by 10 columns. Wanting to be awesome an at randomization in the field we used this table to randomly sample as subset of from 10 possible quadrats in each site. Any cell in the matrix with a value greater than 0.5 was sampled and assigned a 1 and values less than 0.5 were not sampled. Using the matrix probs found in wk4list do the following:

4.1) create a new object myprobs that contains the the matrix probs from wk4list and report the number of rows and columns for this new object

4.2) assign values greater than or equal to 0.5 as 1s and less than 0.5 as 0s and report the sum for the 5th row of the matrix

Bonus material

Finding rows that contain the minimum or maximum

Finding and subsetting min and maximum records (can be helpful if you are doing grid searches...)

which.min(data2$len) # returns the row index of the minimum value of len

which.max(data2$len) # returns the row index of the maximum value of len

data2[which.min(data2$len),] # use the index to get that row of data

data2[which.max(data2$len),] # use the index to get that row of data

Duplicating records

We can duplicate rows of a dataframe by duplicated/repeating the row index. For example,

data_dup <- data2[c(1,1,1,2,2,2),]

many times we get data that has counts that we may want to duplicate each row so there is a row for a single count

weighted_data<- data.frame(spp=c("Critter 1","Critter 2","Critter 3"),counts=c(10,12,5))

weighted_data[ rep(c(1:3),weighted_data$counts),]

Page updated

Google Sites

Report abuse