Week 6: How do I love R? Let me count the ways-- counts, means, sums, ranks, and ... functions
Week 6: How do I love R? Let me count the ways-- counts, means, sums, ranks, and ... functions
Good day-- eh, and welcome to week 6. At this point in the course, you ought to be familiar with R objects, how to manipulate them (i.e., create, coerce, append, merge, subset, query), and how to use comparisons and for loops. These are the basics you will need to get your data ready for summarization and analyses. Here's the script for today's lesson. First a quick if, ifelse, for review.
SUMMARIZING DATA
Before today's lesson, download the following 3 comma separated text files:
Fish.catch.csv: contains fish collection data from 8 streams as part of the Coweeta LTER
Stream.habitat.csv: habitat measurements made concurrently with fish collections
Water.measures.csv: water quality measures made concurrently with fish sampling
Read the files into R and examine. Be sure to set your working directory to the proper location.
# set working directory
setwd("C:/Users/Jim/Desktop")
# read in commas separated files
catch<-read.csv("Fish.catch.csv")
habitat<-read.csv("stream.habitat.csv")
water<-read.csv("Water.measures.csv")
#whats in these data frames
head(catch)
head(habitat)
head(water)
Data summaries are always a good way to check your data prior to any analysis. R has several built-in functions for performing data summaries. Below we have several functions that you can use to summarize values in numeric vectors. Note that the result of all of these functions can be output to R objects. Before we try a few, let’s find out what objects are in the water data frame using the names function.
## what are the variables in the water data frame
names(water)
mean(water$Water_Temp)
median(water$Water_Temp)
sd(water$Water_Temp)
var(water$Water_Temp)
min(water$Water_Temp)
max(water$Water_Temp)
sum(water$Water_Temp)
## how many elements are in this object?
length(water$Water_Temp)
## complete summary of data in water
summary(water)
## what are median 50% and the lower and upper 95%
# quantiles of water temperatures
quantile(water$Water_Temp,probs=c(0.025,0.5,0.975))
# note that you can specify other quantiles by putting different
# he we specify the 20, 40, 60, and 80th percentiles
quantile(water$Water_Temp,probs=c(0.2,0.4,0.6,0.8))
We could combine each one of the objects into a single vector. For example:
## combine mean, standard deviation(sd), min and max of water temperature
water.temp<- c(mean(water$Water_Temp),sd(water$Water_Temp),min(water$Water_Temp),max(water$Water_Temp))
## combine mean, sd, min and max of conductivity
conduct <-c(mean(water$Conductivity),sd(water$Conductivity),min(water$Conductivity),max(water$Conductivity))
## combine mean, sd, min and max of dissolved oxygen
DO<-c(mean(water$DO),sd(water$DO),min(water$DO),max(water$DO))
## Create a table with the combined summary data using rbind function
sum.data <- rbind(water.temp,rbind(conduct,DO))
#print it out
sum.data
# what type of object is it?
class(sum.data)
We also could combine the sum.data with a vector containing the names of the variables. Give it a try on your own. Don’t forget that a matrix cannot contain character and numeric values only a data frame can.
Now that (above) was a bit wasteful. We had to perform the same operation 3 times, once for each object (column) in the data frame. We could be more efficient if we create a for loop to perform the operations. Remember that we can subset data in a data frame using numbers for the rows and columns. In the water data frame, data in columns 3 to 6 contain the measurements that we are interested in summarizing (Water_Temp, Conductivity, DO, Turbidity). If we wanted to calculate the mean of Turbidity, it is in column 6 so:
## calculate mean Turbidity in column 6
mean(water[,6])
If you recall, we can use the for loop index (the number) to select rows or columns using their numbers. Below is a for loop for creating a matrix containing the mean, standard deviation, minimum, and maximum for Water_Temp, Conductivity, DO, Turbidity (columns 3 to 6). Examine it carefully, copy it to your R script and run.
## save the names of the columns in a file, water.name
water.name = names(water)
## create place to put the summary data
sum.table = c()
## for loop for calculating and combining summary data by column
for(i in 3:6){
## calculate summary statistics for each column from 3 to 6
## use rbind to stack the values into a single matrix
sum.table<- rbind(sum.table,c(mean(water[,i]),sd(water[,i]),min(water[,i]),max(water[,i])))
}
To create a nice table, we combine summary data matrix sum.table with the names of the water columns we used. These names are elements 3 to 6 in water.names. Note that we are combining characters and numbers so we must convert the names to a data frame before the cbind. This is why as.data.frame(water.name[3:6]) is inside the cbind function.
## combine variable names with corresponding summary statistics
sum.table <-cbind(as.data.frame(water.name[3:6]),sum.table)
## the columns need names, so let’s do it
colnames(sum.table) = c("Characteristic","Mean","SD","Min","Max")
Creating a data summary table like we did above is probably not a onetime thing that you will do. In fact, we suspect that most of you will be creating data summaries throughout your careers. Wouldn’t it be wonderful if you didn’t have to go through all of those steps each time you wanted to create a similar table? You could if you created a function. So far you have been using R built in functions, but you don’t have to. You could create your own. It’s actually very easy in R. Stay tuned to for next week....
Lets try the same thing but with the habitat dataframe.
## let’s try to summarize the habitat file
## first, see what in it
head(habitat)
## now replace the water dataframe name with habitat
## save the names of the columns in a file, water.name
habitat.name = names(habitat)
## create place to put the summary data
sum.table = c()
## for loop for calculating and combining summary data by column
for(i in 3:6){
## calculate summary statistics for each column from 3 to 6
## use rbind to stack the values into a single matrix
sum.table<- rbind(sum.table,c(mean(habitat[,i]),sd(habitat[,i]),min(habitat[,i]),max(habitat[,i])))
}
## combine variable names with corresponding summary statistics
sum.table <-cbind(as.data.frame(habitat.name[3:6]),sum.table)
## the columns need names, so let’s do it
colnames(sum.table) = c("Characteristic","Mean","SD","Min","Max")
sum.table
Something went wrong above. What was it? Before freaking out... (learning a programming language is as much about failure as about success), let’s see if we can calculate a simple mean on the 4th column:
mean(habitat[,4])
The function should return an NA. This is because the habitat data contain missing values! That was the problem. While we’re here…
We could have avoided the problems with missing data above if we had examined the data first. For example we could have used the summary function to see if the data contained NA.
summary(habitat)
There are several ways to handle missing data when conducting statistical analyses or fitting models. One way is to eliminate missing data and as we learned earlier we can use na.omit to select a data frame with no missing data. Another method is to replace NA with some default value, usually the mean value. Thinking back to last week we can use the ifelse function to make a comparison in a vector and replace values if the comparison is true or false.
Remember that we can use the “is.na” function to identify where data were missing
#identify the missing length data
is.na(habitat$Length)
#calculate the mean length for use below
mean(habitat$Length,na.rm = T)
Let’s put these things together and replace missing data, is.na = TRUE, with the mean calculated above, otherwise (else) if not missing keep the non-missing value
#replace the missing data with the mean length
ifelse(is.na(habitat$Length),11.479,habitat$Length)
We could get really fancy and put the mean function inside of the ifelse function.
## replace missing length data but put function inside ifelse function
ifelse(is.na(habitat$Length),mean(habitat$Length,na.rm = T),habitat$Length)
PERFORMING SUMMARIES BY GROUPS
Awesome! Thus far we learned how to summarize data in vectors and elements on a data frame. More often than not, we wish to summarize data by some groups, say by year or study site or geographic region. To do this, we need to learn how to use array (matrix, list) operators. That is, operators that use data in more than one column. One useful function is tapply. Use the help() function to examine the tapply syntax. Below we use tapply to calculate mean sample unit length (Length) by habitat type (Habitat) in habitat data frame.
## first examine contents of habitat
head(habitat)
Now we use tapply listing (in order): the variable to summarize (or analyze) , the grouping variable, the function to use (hear the mean) and function options (here remove NA):
## use tapply to summarize by groups
by.hab.mean <- tapply(habitat$Length,habitat$Habitat, mean, na.rm = T)
#print it out
by.hab.mean
If we want to use the object created by tapply, it’s important to know what type of object was created.
# what kind of object is it
class(by.hab.mean)
str(by.hab.mean)
by.hab.mean[1]
Here, we see that by.hab.mean is an array that consists of 3 numeric variables and 3 headings (character variables). To put this in a more manageable and familiar form, we can coerce the list into a data frame.
## coerce into a data frame
as.data.frame(by.hab.mean)
One thing about this coercing an array is that the headings were turned into rownames. As far as we’re concerned, rownames are useless because you can’t use or manipulate them like the elements of a data frame, so let’s extract the rownames and put them in a column in the data frame:
## coerce into a data frame with row names now in a column
data.frame(rownames(by.hab.mean),by.hab.mean)
We end our discussion of tapply here because there is a better and more useful function for summarizing data: aggregate, which requires users to specify the columns to summarize and the groups using a list if column numbers inside of brackets. For example, columns 4 and 5 in the habitat data frame contain Length and Width measurements of sample units and the 3rd column of habitat contains habitat type so to calculate summary statistics for Length and Width “habitat[c(4,5)]” by habitat type “by = habitat[c(3)]”, the syntax of aggregate is:
### a new function for summarizing data by groups
## values inside [c()] correspond to column numbers
aggregate(habitat[c(4,5)], by = habitat[c(3)],mean,na.rm = T)
Copy the code and paste in R script, run and examine the output. We can aggregate by more than 1 group using the aggregate function. For example, to get means by stream and habitat type, "by = habitat[c(1,3)]”. Copy the code below to your R script and examine. Interpret the code, run and examine the output.
## we can do by more than 1 group for example stream and habitat
aggregate(habitat[c(4,5)], by = habitat[c(1,3)],mean,na.rm = T)
## lets create an object with all of the means
hab.means <-aggregate(habitat[c(4:9)], by = habitat[c(1,3)],mean,na.rm = T)
## what kind of object did we create
class(hab.means)
As you can see, aggregate is very useful and produces an object (a data frame) that is easy to work with.
SUMMARIES WITH CATEGORICAL DATA
The previous sections dealt with summarizing numeric data. Here, we cover how to summarize categorical data (characters, strings, and factors). For example, you may want to count up the numbers of times that a species was caught during your research. To demonstrate, we will use the catch data read in earlier. First, examine the (partial) contents of the data frame.
head(catch)
You should have noticed that the data consisted of electrofishing catch data from stream habitats. Each fish that was collected was entered in the database along with the fish’s total length (TL.mm) and the place date and habitat type where it was captured. The built-in function table is useful for summarizing categorical data and will count up the number of instances that a category occurs in the data frame or vector. For example, the code below counts up the number of times each species (catch$Species) occurs in the data frame. Examine the code, run it, and examine the output.
## new function for summarizing data
spc.collect <-table(catch$Species)
spc.collect
The numbers below each species name is the number of times the species is listed in catch$Species. Because each fish that was captured is listed by species, this number corresponds to the total number of individual fish of that species that were captured. This number could be used in an analysis so let’s see what kind of object is spc.collect.
## that type of object is this
class(spc.collect)
str(spc.collect)
Hmmm, it’s a table…another type of R object (similar to an array) that has limited usefulness for further analyses. Below, we will coerce the table into a data frame and rename the columns. Examine the code and interpret each step, then run the code and examine the output.
spc.collect<- as.data.frame(spc.collect)
spc.collect
colnames(spc.collect) = c("Species","Total.catch")
spc.collect
The spc.collect data frame we created could be written to an external file and included in a thesis or report or could be used for further analysis in R.
As you may have guessed, we can include more than one column (or object) in a table function. For example, the below code counts up the number of instances of each combination of species (catch$Species), stream (catch$Stream), and habitat type (catch$Habitat). Examine the syntax. Notice that table function is nested within the as.data.frame function. This means that table will create an object that will be coerced into a data frame and then written to the data frame tot.catch. Run the code and examine the output.
## create an object that
tot.catch<-as.data.frame(table(catch$Stream,catch$Habitat,catch$Species))
## print it out
tot.catch
You should have noticed that table includes values (zeros) where there were no instances of various combinations of the species, stream and habitat type. For example:
1 Howards Branch Pool Blacknose Dace 0
2 L Ball Crk Pool Blacknose Dace 0
3 L Drymans Fork Pool Blacknose Dace 0
4 M Ball Crk Pool Blacknose Dace 0
5 Sheppard Crk Pool Blacknose Dace 0
These are instances where the data contains no observations of Pool habitat type and Blacknose Dace at these 5 streams. Be absolutely sure that you understand that table will produce summaries that include combinations of the variables that were never observed in the data frame. This may be desirable, for instance-- when you want to know samples or locations where a species was not collected for occupancy modeling, or may not be desirable. Just be sure that you know what you want.
Can you think of a way we could eliminate the zero observations?
Ok, lets create a similar table (as above) using the aggregate function. Here we are counting up the number of elements of TL.mm (the number of fish) by stream, habitat, and species: Examine the code and try to interpret. Run the code and examine the output.
## lets create a similar object using aggregate and counting the number of
## elements using length
tot.catch.nozero<- aggregate(catch[c(5)], by = catch [c(1,3,4)],length)
#print it out
tot.catch.nozero
You should have noticed that the code produced a data frame with the total number of fish collected at each site and habitat type, by species but with no zeros. (Unfortunately, the heading to the column is TL.mm – this would have to be changed to Total.collected or similar to avoid confusion).
Put your thinking caps on. Let’s say that we needed to calculate the total number of species captured in a sample. How could we do this?
There are probably a large number of ways to do this. What we need is a data frame with the species that were collected at each site, date, habitat type. The catch data frame contains multiple observations corresponding to individual fish. We need to reduce the data frame down to just the species that were collected. We can do this using the unique() function. Here we want the unique combinations of stream, date, habitat type, and species, columns 1 to 4 in catch. Examine the code below and submit to R.
# create species using unique combinations of stream, date, habitat type, and species
species<-unique(catch[,1:4])
# print it out
species
# create a new object (column in species) and assign a 1 to species that were
# captured aka detected
species$detect = 1
## calculate richness but summing the total number of species detected in a sample
richness<- aggregate(species[c(5)], by = species[c(1:3)],sum)
We now have the total number of species detected (aka species richness) in the detect column. Let’s see what habitat variables were correlated with species richness. We first need to combine the richness data with the habitat data using the merge() function. Note that fish may not have been collected (catch = zero fish) in all samples, so we use the "all = TRUE" option to include all stream, dates, and habitats. Reminder: the all option in merge() indicates that all of the observations in both data frames should be included in the new data frame.
# create new combined data frame
for.correl <- merge(richness,habitat, all = TRUE)
# print it out
for.correl
After examining the contents of for.correl, you should see that richness estimates (the detect column) are NA for several samples. There are the samples where no fish were collected hence species richness = 0 (zero). To change these values to zero, we can use ifelse() like we did above, except we change the NA to zero. Examine and run the code below. Did it take care of the “missing” richness data?
# replace NA with zero
for.correl$detect<- ifelse(is.na(for.correl$detect),0,for.correl$detect)
# print it out
for.correl
To calculate a correlation between habitat variables and species richness, we can use the cor() function. The cor function uses 2 or more vectors (i.e., matrices) and calculated the correlations among these objects. For example, the code below specifies a correlation between detect (richness) and sample unit length, both in the for.correl data frame. Examine the code and run.
cor(for.correl$detect,for.correl$Length)
What happened after you submitted the code? You got an NA, didn’t you? Why do you think this happened? Well, we got NA when we tried to calculate a mean with missing data another reason for NA could be that we were trying to calculate a correlation using a factor or character variable. So let’s first use summary to see what’s in for.correl.
# summarize for.correl
summary(for.correl)
Submitting the above should give you:
Stream Date Habitat detect Length Width Depth
Howards Branch :9 22-Jul-09:16 Pool :22 Min. :0.000 Min. : 3.00 Min. :1.200 Min. :0.06000
L Ball Crk :9 23-Jul-09:29 Riffle:18 1st Qu.:0.000 1st Qu.: 8.55 1st Qu.:1.600 1st Qu.:0.09525
L Drymans Fork :9 5-Aug-09 : 6 Run :11 Median :2.000 Median :11.30 Median :2.265 Median :0.13000
Sheppard Crk :6 Mean :2.078 Mean :11.48 Mean :2.903 Mean :0.16594
WTSD 7 :6 3rd Qu.:4.000 3rd Qu.:14.00 3rd Qu.:4.000 3rd Qu.:0.18500
High White Crk (WTSD 14):5 Max. :8.000 Max. :23.40 Max. :6.800 Max. :0.60000
(Other) :7 NA's :3 NA's :3 NA's :3
Velocity Cobble Gravel
Min. :0.0300 Min. : 0.00 Min. : 0.00
1st Qu.:0.1325 1st Qu.:15.00 1st Qu.: 5.00
Median :0.2140 Median :20.00 Median :20.00
Mean :0.2381 Mean :25.71 Mean :18.47
3rd Qu.:0.3000 3rd Qu.:30.00 3rd Qu.:30.00
Max. :0.6600 Max. :90.00 Max. :50.00
NA's :1 NA's :2 NA's :2
Looks like detect and Length are numeric (summary calculated means and percentiles), however Length has 3 NA’s. This is probably the reason we got the NA when we tried to calculate a correlation. You may be tempted to use na.rm = T DO NOT. First check the syntax of cor() using the help() function.
help(cor)
The help file indicates that we handle NA with the “use =” option. There are several options available for use, here we will specify use = "pairwise.complete.obs". Below is the code for calculating Pearson correlations between richness (detect) and sample unit length and among data in columns 4 to 10: detect, Length, Width, Depth, Velocity, Cobble, and Gravel.
cor(for.correl$detect,for.correl$Length, use = "pairwise.complete.obs")
cor(for.correl[,4:10], use = "pairwise.complete.obs")
This Weeks Assignment
Due 1 week from today by 5pm Pacific. Use the 3 data frames (above): catch, habitat, and water to complete the following:
1) Fill in any missing data in habitat using the average column values.
2) Combine revised habitat (from #1 above) and water data frames into a single data frame.
3) Create a summary table of the mean and standard deviation for all of the variables in the combined ( from #2) data frame (Water_Temp, Conductivity, DO, Turbidity, Length, Width, Depth, Velocity, Cobble, and Gravel), grouped by habitat type (Habitat).
4) Using the catch data frame, calculate the total number of fish collected in each sample (a sample is a unique combination of Stream, Date, and Habitat) and put the results in a new data frame.
5) Combine the data frames created in #2 and #4 and calculate the correlation between total catch and Water_Temp, Length, Width, Depth, and Velocity. (Hint: don’t forget that there were zero fish collected in a several samples).
Please save all of the code that you used in a single script and submit the script in an email attachment to Jim.