Week 4: Selecting and subsetting data, appending, merging, sorting and duplicating fun
Week 4: Selecting and subsetting data, appending, merging, sorting and duplicating fun
Learning objectives
using bracket notation
selecting and subsetting operators
getting R to do your data manipulation (merging, sorting, appending, etc.) bidding
Selecting and subsetting what is the difference?
Well they are related. In R you can select data and view it manipulate it, and so on. Subsetting takes selecting a step further and makes a new object. Remember that R is an object oriented language. So you can select parts of an object and you can subset objects to make a new object. We will start out by selecting parts of an object and this will lead into subsetting. The script for this weeks lesson is here.
Selecting
What can we select when it comes to R objects? Lets give it a shot. First we will generate some data to play with. The code below makes 3 vectors, 2 dataframes, and 1 matrix.
# lets make 3 vectors
lengths<- runif(100,30,500)
wghts<- 0.0002*lengths^3
relative_weight<- wghts/300
# lets make 2 data.frames
data1<- data.frame(year=rep(2009:2012,25), len=lengths, wghts=wghts, rel_weight=relative_weight)
data2<- data.frame(year=sample(2009:2012, 50, replace=TRUE), len=runif(50,100,800))
data2$wghts<- 0.0002*data2$len^3.01
data2$rel_weight<- data2$wghts/500
# what are the field names we can select?
names(data1)
# select the field length
data1$len # returns the data in the length field
data1$wghts # returns the date from the weight field
# lets make a matrix of random numbers
x<- matrix(runif(120,0,1),nrow=10, ncol=12,byrow=TRUE)
What about lists? I heard about them last week, aren't they kind of like having multiple objects (vectors, dataframes, matrices) in a single object?
Yep...lets make one that combines the 3 vectors, 2 dataframes, and 1 matrix from above in a list! Exciting...
mylist<- list(lengths=lengths, weights=wghts,
relative_weight=relative_weight,
data1=data1, data2=data2, matrix1=x) # now we have a list!
OK great, now how do I select that data from this infernal list? Columns from objects within a list can be selected using the $ notation similar to what we used above. Lets give it a shot---
mylist$length # the vector from above
cbind(mylist$lengths,lengths)# they are the same!
mylist$weights # the vector from above
cbind(mylist$weights,wghts)# they are the same!
mylist$data1$year # the vector from above
data1$year
cbind(mylist$data1$year,data1$year)# they are the same!
mylist$data2$len
data2$len
cbind(mylist$data2$len, data2$len)# they are the same!
Pretty easy, right?
Bracket notation
What about bracket notation? You may have seen the cool kids using it...last week
I can guaranty that if you stick with R you will use brackets at some point, and probably be glad you did. Brackets are a way to select rows or columns of a dataframe or list. We will start with a dataframe. The first thing to know is how brackets work. Brackets are really nothing to be afraid of as we learned last week, they index rows first then columns (i.e., dataframe[row,column]). Lets compare this with our $ notation from above.
Dataframes
What if we wanted to select the length field just like we did above? The following bit of code does the same as above for length but with brackets!
# SELECT THE 3RD FIELD (LENGTH) FROM THE DATA
data2[ , 3] # THIS RETURNS THE SAME AS data2$len
names(data2) # note that length is the 3rd field
data2$len
data2[ , 4]
data2$rel_weight
Where bracket notation gets convenient is when you want to select a couple of fields
data2[ ,c(2,3)] # SELECT LENGTH AND WEIGHT FIELDS
data2[ ,c(2,3,4)] # SELECT LENGTH, WEIGHT, AND RELATIVE WEIGHT
data2[ ,c(2:4)] # this is the same as the above line of code
data2[ ,c(1,3:4)] # pretty flexible
data2[c(1,4,7,19),c(1,3,4)]
In the above bit of code there is nothing in the location for rows. This tells R to just get all the rows. In this case there are 30 rows in the dataframe.
data2[c(1:30),3] # returns the same as dat[ ,3]
data2[ , 3] # same as above
Just as we did with the column indexes we can select rows that we want. For example, lets select the first 10 rows of the length and field columns.
data2[c(1:10),c(3,4)]
we can also select arbitrary rows
data2[c(1,4,7,19),c(1,3,4)]
Fun huh? You are likely curious how bracket notation might make your life easier? What if you needed to get the mean length for the rows 1, 3, 5, 7, 9, and 11.
mean(data2[c(1,3,5,7,9,11),3])
Lists
You can also use bracket notation with lists, however we won't cover these in too much detail in this lesson in this lesson. In my experience lists are rarely used in typical day to day R programming but you may like to use them as you get more proficient or if you are extracting elements from a list.
# Lists
mylist$data1$log_transformed_len<- log(mylist$data1$len)
mylist$l_weight<- log(mylist$data1$wght) # for weight
mylist$data1$l_weight
names(mylist)
names(mylist$data1)
Now we can get the various bits of data within the list using either $ or bracket notation. Lets get the means we just calculated for log length and log weight.
mylist$data1$l_weight<- log(mylist$data1$wght) # for weight
names(mylist$data1)
# adding some more to the list
mylist$mean_l_weight<- mean(mylist$data1$l_weight)
mylist$mean_l_weight
We used bracket notation before with dataframes, can we use them with list? You betcha... Lets get the first dataframe in the list, this is done using a special 'double' brackets:
mylist[[1]]# get the first object
mylist[[5]]# get the 5th object
Now to index rows and columns within the list element all we need to do is use brackes just as we did for dataframe but with the special double brackets to tell us what object we want.
mylist[[5]][1,]
mylist[[5]][,2]
Think of vectors as a single column from a dataframe. For example we could extract a column of data from our dataframe dat and assign it to object a.
victor<- data1$len
victor # the vector
We can select things from the vector. Lets select the 3rd value
victor[3]
How about selecting all the values greater than 300.
victor>300 # selects all the values greater than 300
we can even use the brackets in a function
mean(victor[victor>300])# not the same as
mean(victor)
new_victor<- victor[victor>300]
mean(new_victor)
Operators
Some useful tricks application of operators
dat<- mylist$data2
# LETS ADD A FEW VALUES
dat[rep(c(TRUE,FALSE),25),2]<- -99
dat[sort(rep(c(TRUE,FALSE),25)),3]<- -99
Assigning NA to missing values.—In the lecture for week 3, we used a -99 to fill missing values in the lesson (we didn't get to this in lecture but it is in the narrative). This next line of code selects all the cells in the dataframe that equal -99 and assigns them as NA.
dat[dat==-99]<-NA
Well that was super easy to take care of those pesky missing values. This also brings up another way bracket notation can be useful—dealing with NA values.
Dealing with NA values.—Most data will have missing values, denoted as NA in R. Probably one of the most infuriating things to do in R—besides deal with dates—is dealing with NA values. NA values are identified using the is.na() command. This command returns a TRUE/FALSE if a cell is NA. Using this command we can select the NA values. So if a cell is NA then the function returns TRUE.
is.na(dat$len)
we can use this in bracket notation as well to select the rows where length data are missing
dat[is.na(data$len),]
We can also use a conditional expression to select the rows where there length data is not NA.
dat_no_na<- dat[!(is.na(dat$len),] # super intuitive right?
dat # not the same as dat_no_na
What if we want to just drop the missing values in a vector or dataframe?
dat$len
na.omit(dat$len)
We can also use this to drop the missing value in a dataframe using vector notation
dat[na.omit(dat$length),]
A "!" is use for 'not' (see table above). A similarly useful function is complete.cases(). This function returns a True/False if all the cells in a row (i.e., record) contains values.
complete.cases(dat) # returns a vector of true/falses
dat_complete<- dat[complete.cases(dat),] # select rows with no NA values
dat_complete # look at new data without NAs
dat # the data with NAs
Subsetting: subset()
Subsetting is essentially the same as selecting, but you think about doing more sophisticated subsets—but you could also do this using bracket notation.
dat2010<- subset(data1, year==2010)
dat_after_2010<- subset(data1, year>2010)
dat_all_but_2010<- subset(data1, year != 2010)
dat_even_years<- subset(data1, year %in% c( 2008, 2010, 2012))
Using and in subsetting-R treats and using the "&" sign
dat_2010_length_100<- subset(data1, year==2010 & length > 100)
dat_2010_length_100 # FAIL! X#amp;*^, WHY?
names(data1)
dat_2010_length_100<- subset(data1, year==2010 & len > 100)
Using and/or in subsetting-R treats and using the "&" sign and "|"
new_dat<- subset(data1, len > 400 | len < 50 )
new_dat<- subset(data1, year == 2010 & (len > 100 | len < 60))
Merging: merge()
Merging is taking two dataframes and combining them based on a common index or indexes (kinda like the relational databases we talked about in week 2!). Lets make a dataset to merge with data1.
yeardata<- data.frame(year=c(2009:2012),type=c("wet","dry","extra dry", "bone dry"))
yeardata
OK done, we have a dataset that has some yearly data we could use in an analysis. What is common between yeardata and data2?
names(data1)
names(yeardata)
Well they both have year. It makes intuitive sense to merge year data with year data!
mergedata<- merge(data1, yeardata, by="year")
mergedata # cool there is our merged datasets
But what happens if all the values of one dataset are not in the other? Lets drop the 2009 year from the yeardata we set up before. We can do this using subset!
mergedata<- subset(mergedata, year>2009)
mergedata # no 2009 data!
Ok, lets see what happens when we merge that datasets again.
mergedata<- merge(data1, yeardata, by="year")
mergedata # uncool, there is no data for 2009
unique(mergedata$year)# lets make sure
We can tell R to include those records that do not have year data by specifying all.x=TRUE in the merge function.
mergedata<- merge(data1, yeardata, by="year",all.x=TRUE)
table(mergedata$year)
head(mergedata,30)# whew that is what we wanted!
We can merge based on 2 fields as well
# make some data to merge
data<- data.frame(year=sample(2009:2013, 8, replace=TRUE),
month=sample(7:9, 8, replace=TRUE),
weather=sample(letters[1:3],8,replace=TRUE))
# make some data to merge with that data
lots_o_data<- data.frame(year=sample(2009:2013, 80, replace=TRUE),
month=sample(7:9, 80, replace=TRUE),
length=runif(80, 100,600))
data
lots_o_data
lots_o_lots_o_data<- merge(data, lots_o_data, by=c("year","month"),all=TRUE)
lots_o_lots_o_data
There is a dataset that was merged based on year and month!
Appending data
Adding columns to the side: cbind()
This can be done using the cbind command to making a new object (remember we used it above to look at and compare 2 vectors?). We can do more than 2 if we want
all_vectors<- cbind(mylist$length, mylist$weight, mylist$relative_weight)
Just be careful you can cbind anything as long as the number of rows are equal!
Adding rows to the bottom: rbind()
Similarly we can append rows to a dataframe or matrix
Dataframes
names(data1)
names(data2)
fulldata<-rbind(data1,data2)
dim(data1)
dim(data2)
dim(fulldata)
Matrices
class(x) # make sure we can use a matrix
bigmatrix<- rbind(matrix1, matrix1)
dim(matrix1)
dim(bigmatrix)
Adding rows to the bottom when column names don't match: rbind.fill()
There are times when you would like to append 2 datasets that do not have every column in common. Solution? rbind.fill()
# install.packages("plyr")
require(plyr)
data1_no_len<- data1[,-2]
names(data1_no_len)
names(data2) # names don't exactly match
fulldat<-rbind(data1_no_len,data2)# fail.... errror
fulldat<-rbind.fill(data1_no_len,data2)
Sorting: using the sort() and order() functions
You can think about sorting as taking values and putting them in ascending or descending order. For example
sort(data2$year, decreasing=TRUE)
sort(data2$year, decreasing=FALSE)
year_sorted<- sort(data2$year, decreasing=FALSE)
cbind(year_sorted, data2$year) # not the same!
Ordering is similar but returns the element ids in order. Lets look at what happens when we use order on data2$year
order(data2$year, decreasing=TRUE)
order(data2$year, decreasing=FALSE)
order(data2$year,data2$len, decreasing=TRUE)
Well those most certainly are not years. They are the order of the element numbers. We can use it to make a new object that will index the rows and make a dataset that is ordered or sorted by some value.
orderid<-order(data2$year,data2$len, decreasing=TRUE)
We can take that vector and use it as a row index
data2_sorted<- data2[orderid,]
Or we can feed the function directly in.
data2_sorted<- data2[order(data2$year),]
data2_sorted
We can also order a dataframe on more then 1 variable...
data2_sorted<- data2[order(data2$year, data2$len),]
data2_sorted
Now you should be able to get R to do most anything you want it to with data!
Exercises
This exercise requires 2 input files.
1) wk4data.Rdata- review week 1 materials if you need a refresher on how to load() a *.Rdata file
OK to the the fun stuff. Work through the following exercises and answer the numbered items in comments within the R script.
Heads up! Name these files as follows: Lastname_wk4.R and email them to Jim by next week at 5 pm.
Exercise 1) wk4list contains 2 dataframes: siteData and catchData. Merge these two datasets based on site into a new object named megadata. Using megadata report the following:
1.1) How many rows and columns does megadata have?
1.2) What is sum of the utme column?
1.3) What is the sum of the first 20 elements of the length column in megadata (hint-brackets aren't just for holding a shelf)
Exercise 2) Doh! we forget to get elevations for each site when we were out in the field. Luckily we did record the location using our fancy GPS (which could have given us the elevation to, but hindsight it 20/20). Fortunately, we were at Bombs Away and convinced our friend Jenny GISAllStar to extract the elevation for each coordinate! She sent the data back in the file elevationData.csv. We did not send her the site names, so we cannot merge the data back with our dataset using site (double DOH!). But it can be merged by same utme and utmn coordinates. For this exercise, read in elevationData.csv as an object named elevation_data and merge it with megadata created in exercise 1 into a new object megadata_ele. megadata_ele needs to be in ascending order (smallest at top and largest at the bottom) by elevation and length.
2.1) Now how many rows and columns does megadata_ele have?
2.2) How many unique elevation values are there in megadata_ele?
2.3) What is sum of for the first 25 elements of the length column in megadata_ele?
Exercise 3) We are only interested in organisms with moderate lengths from low elevations. Make a new object megadata_mod_small that includes only the rows of megadata_ele that have an elevation less than or equal to (<=) 1500 and lengths greater than 200 and less than or equal to 600.
3.1) How many rows and columns does megadata_mod_small have?
3.2) What is the sum for the weight column (hint if you say it is NA then you are more then likely not correct)
Exercise 4) The last object in wk4list is a matrix of values varying from 0 to 1 (probs) (hint- names(wk4list)). The matrix is 30 rows by 10 columns. Wanting to be awesome an at randomization in the field we used this table to randomly sample as subset of from 10 possible quadrats in each site. Any cell in the matrix with a value greater than 0.5 was sampled and assigned a 1 and values less than 0.5 were not sampled. Using the matrix probs found in wk4list do the following:
4.1) create a new object myprobs that contains the the matrix probs from wk4list and report the number of rows and columns for this new object
4.2) assign values greater than or equal to 0.5 as 1s and less than 0.5 as 0s and report the sum for the 5th row of the matrix
Bonus material
Finding rows that contain the minimum or maximum
Finding and subsetting min and maximum records (can be helpful if you are doing grid searches...)
which.min(data2$len) # returns the row index of the minimum value of len
which.max(data2$len) # returns the row index of the maximum value of len
data2[which.min(data2$len),] # use the index to get that row of data
data2[which.max(data2$len),] # use the index to get that row of data
Duplicating records
We can duplicate rows of a dataframe by duplicated/repeating the row index. For example,
data_dup <- data2[c(1,1,1,2,2,2),]
many times we get data that has counts that we may want to duplicate each row so there is a row for a single count
weighted_data<- data.frame(spp=c("Critter 1","Critter 2","Critter 3"),counts=c(10,12,5))
weighted_data[ rep(c(1:3),weighted_data$counts),]