Read in the data and summarize
Set the working directory
# setwd('.../your directory here/...')
dat <- read.csv("dat.csv")
What the data looks like.
head(dat)
## X ID year length weight stage
## 1 1 1 2009 701.0 30.30 Adult
## 2 2 2 2012 593.9 52.78 Adult
## 3 3 3 2012 729.5 101.98 Adult
## 4 4 4 2010 563.2 65.93 Adult
## 5 5 5 2011 231.1 10.94 Juv
## 6 6 6 2012 723.0 77.13 Adult\
str(dat)
## 'data.frame': 1000 obs. of 6 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ year : int 2009 2012 2012 2010 2011 2012 2011 2010 2012 2009 ...
## $ length: num 701 594 729 563 231 ...
## $ weight: num 30.3 52.8 102 65.9 10.9 ...
## $ stage : Factor w/ 2 levels "Adult","Juv": 1 1 1 1 2 1 1 1 1 1 ...
summary(dat)
## X ID year length
## Min. : 1 Min. : 1 Min. :2009 Min. :200
## 1st Qu.: 251 1st Qu.: 251 1st Qu.:2009 1st Qu.:338
## Median : 500 Median : 500 Median :2010 Median :489
## Mean : 500 Mean : 500 Mean :2010 Mean :493
## 3rd Qu.: 750 3rd Qu.: 750 3rd Qu.:2011 3rd Qu.:644
## Max. :1000 Max. :1000 Max. :2012 Max. :800
## weight stage
## Min. : 0.4 Adult:857
## 1st Qu.: 10.2 Juv :143
## Median : 31.3
## Mean : 76.6
## 3rd Qu.: 90.8
## Max. :700.5
names(dat)
## [1] "X" "ID" "year" "length" "weight" "stage"
Just how much data are we dealing with?
dim(dat)
## [1] 1000 6
nrow(dat)
## [1] 1000
ncol(dat)
## [1] 6
Ok, all looks good. Let’s start plotting.
Scatter plots
The following code will make the figure below.
plot(weight ~ length, dat)
Figure 1.—Scatter plot figures produced by R code in this section.
Lets start with panel a. This is your basic plot of length versus weight. The default R code to do this is
plot(weight ~ length, dat)
Note the formula notation which is y variable ~ x variable you could as well, but I prefer formula notation.
plot(dat$length,dat$weight)
Super, we have a plot of some data! Very satisfying! But are you satisfied?… Or rather is your advisor satisfied?
No, your advisor wants those circles to be filled…ugh!
We can do this by changing the pch and we can reproduce panel b from Figure 1. See the material at the end of the lab for the pch values that correspond to the plotting values.
plot(weight ~ length, dat, pch = 19) # filled circles!
Oh, I forgot to mention you have co-advisors - and your other advisor thinks they should be filled in squares (her thought process is more angular!). Easy peasy, just use pch again but this time with a value of 15.
plot(weight ~ length, dat, pch = 15) # filled squares
Wildcard! What about filled in triangles? Again easy peasy, just use pch again but this time with a value of 17.
plot(weight ~ length, dat, pch = 17) # filled triangles
Adding custom x- and y-axis labels
That was a fun exercise, but we come full circle and have decided that filled in circles will work best (full circle… get it?). But, being good researchers we want to change the x and y-axis labels to incorporate the units. We need to do this because R by default uses the name of the column. So we could rename the column in the data.frame, but there has to be a better way. All we need to add a custom axis label by adding a xlab and ylab argument. The following code demonstrates this.
plot(weight ~ length, dat, pch = 19, xlab = "Length (mm)", ylab = "Weight (g)")
We can also use the xlim and ylim argument to change the graphic area plotted. The code below produces a figure that limits the x- and y-axis to be between 0 and 400.
plot(weight ~ length, dat, pch = 19, xlab = "Length (mm)", ylab = "Weight (g)", xlim = c(0, 400), ylim = c(0, 400))
Wow! The data looks like it has some structure in it that we could illustrate by grouping some of the data and plotting it using different colors for hears in the data.
First we are going to start with a 'blank canvas to plot on by specifying type='n' which will give us the plot below.
plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "Weight (g)", xlim = c(200, 800), ylim = c(0, 800), type = "n")
Now we can add things to the plot. This is accomplished using the points() function.
Lets add some data for 2009. The points() function has a subset argument that we can specify a column in the data and some sort of subsetting argument (Remember back to Week 4? all that applies here)
plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "weight (g)", xlim = c(200,
800), ylim = c(0, 800), type = "n")
points(weight ~ length, dat, subset = year == 2009, pch = 19)
points(weight ~ length, dat, subset = year == 2010, pch = 1)
points(weight ~ length, dat, subset = year == 2011, pch = 10)
points(weight ~ length, dat, subset = year == 2012, pch = 15)
But the above figure is a bit difficult to differentiate, and I certainly wouldn't use in a presentation. How about we jazz it up with some color by adding a col argument?
plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "weight (g)", xlim = c(200,
800), ylim = c(0, 800), type = "n")
points(weight ~ length, dat, subset = year == 2009, pch = 19, col = "red") # add data for 2009
points(weight ~ length, dat, subset = year == 2010, pch = 1, col = "black") # add data for 2010
points(weight ~ length, dat, subset = year == 2011, pch = 10, col = "blue") # add data for 2011
points(weight ~ length, dat, subset = year == 2012, pch = 15, col = "green") # add data for 2012
Wow, that really pops! Now how do we know what means what? We need a legend, not a mythical story but a way to make the colors and symbols mean something.
The bare minimum that we need to specify to the legend() function is where it should be located, what is the legend, what are the symbols used, and what color are those symbols. Be careful here, order is important! Lets recreate the plot above and add a legend.
plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "weight (g)", xlim = c(200,
800), ylim = c(0, 800), type = "n")
points(weight ~ length, dat, subset = year == 2009, pch = 19, col = "red") # add data for 2009
points(weight ~ length, dat, subset = year == 2010, pch = 1, col = "black") # add data for 2010
points(weight ~ length, dat, subset = year == 2011, pch = 10, col = "blue") # add data for 2011
points(weight ~ length, dat, subset = year == 2012, pch = 15, col = "green") # add data for 2012
## Adding a default legend
legend("top", legend = c("2009", "2010", "2011", "2012"), pch = c(19, 1, 10,
15), col = c("red", "black", "blue", "green"))
The use of 'top' as the first argument tells R where to plot the legend. The rest gives the text legend, symbol and color. The location argument can be: “topleft, top, topright, left, right, bottomleft, bottom, bottomright
The code below makes the same previous plot but with the legend in the top left hand corner.
plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "Weight (g)", xlim = c(200,
800), ylim = c(0, 800), type = "n")
points(weight ~ length, dat, subset = year == 2009, pch = 19, col = "red") # add data for 2009
points(weight ~ length, dat, subset = year == 2010, pch = 1, col = "black") # add data for 2010
points(weight ~ length, dat, subset = year == 2011, pch = 10, col = "blue") # add data for 2011
points(weight ~ length, dat, subset = year == 2012, pch = 15, col = "green") # add data for 2012
## Adding a default legend
legend("topleft", legend = c("2009", "2010", "2011", "2012"), pch = c(19, 1,
10, 15), col = c("red", "black", "blue", "green"))
Some personal preferences.
I prefer my y-axis labels to be parallel to the x-axis which can be specified by the argument las=1. I also prefer my legends to not have a box which is taken care of with the argument bty='n'
plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "Weight (g)", xlim = c(200,
800), ylim = c(0, 800), type = "n", las = 1)
points(weight ~ length, dat, subset = year == 2009, pch = 19, col = "red") # add data for 2009
points(weight ~ length, dat, subset = year == 2010, pch = 1, col = "black") # add data for 2010
points(weight ~ length, dat, subset = year == 2011, pch = 10, col = "blue") # add data for 2011
points(weight ~ length, dat, subset = year == 2012, pch = 15, col = "green") # add data for 2012
## Adding a default legend
legend("topleft", legend = c("2009", "2010", "2011", "2012"), pch = c(19, 1,
10, 15), col = c("red", "black", "blue", "green"), bty = "n")
Line plots
Most of what we have learned for scatterplots applies to line plots! Lets plot the mean weight by year for the data as a line plot. First we need to aggregate the dataset in the code below and make a new dataframe lindat.
lindat <- aggregate(weight ~ year, dat, mean)
lindat$length <- aggregate(length ~ year, dat, mean)$length
lindat$n <- aggregate(weight ~ year, dat, length)$weight
lindat$var <- aggregate(weight ~ year, dat, var)$weight
lindat$lci <- lindat$weight - 1.96 * sqrt(lindat$var)/sqrt(lindat$n)
lindat$uci <- lindat$weight + 1.96 * sqrt(lindat$var)/sqrt(lindat$n)
The code we will be working with will step through lindat created by the code above to plot panels A-D below.
Line plots are done by specifying the argument type='l'. By default it is type='p', which returns a scatter plot as we just worked through. Lets plot mean weight by year using the code below
plot(weight ~ year, lindat, type = "l", las = 1, xlab = "Year", ylab = "Mean weight (g)")
Well that plot is a bit misleading, we don't actually have data for all points on the line but it illustrates trends really well. How about a mix of points for data and a line to illustrate trend. This can be done using the argument type='b' where the b is short for both as in both points and lines.
plot(weight ~ year, lindat, type = "b", las = 1, xlab = "Year", ylab = "Mean weight (g)")
OK now I am just nit picking this thing to death, but that is what we do right? What is year 2010.5? Especially if we don't have any data? We can deal with this by turning off the x-axis in the plot using the xaxt='n' argument and specifying our own custom one using the axis() function.
# fix the x-axis (xaxt='n')
plot(weight ~ year, lindat, type = "b", las = 1, xlab = "Year", ylab = "Mean weight (g)", xaxt = "n")
axis(side = 1, at = c(2009, 2010, 2011, 2012), labels = c("2009", "2010", "2011", "2012"))
In the axis() function the arguments specify the following:
side: what side to add the axis to (1=bottom, 2=left, 3=top, 4=left)
at: where to put tick marks
labels: what to label those ticks marks
Lines illustrating confidence intervals (or any sort of line you may want) can be added using the segments() function. This function requires two points, a start (x0, y0) and an end (x1, y1). To add lines representing a confidence interval all you need to do is specify the x location and then the lower and upper CI.
Figure 3.—The code below was used to construct this figure demonstrating how confidence intervals can be illustrated on plots using the segments() and arrows() functions
plot(weight ~ year, lindat, type = "b", las = 1, xlab = "Year", ylab = "Mean weight (g)", xaxt = "n", ylim = c(0, 200), pch = 19)
axis(side = 1, at = c(2009, 2010, 2011, 2012), labels = c("2009", "2010", "2011", "2012"))
segments(x0 = lindat$year, y0 = lindat$lci, x1 = lindat$year, y1 = lindat$uci)
We can get more traditional 'whiskers' using the arrows() function and specifying that the arrow angle using the arguement angle=90. The length of the end arrow is how wide it will be on the graph (measured in inches).
# whiskers
plot(weight ~ year, lindat, type = "b", las = 1, xlab = "Year", ylab = "Mean weight (g)", xaxt = "n", ylim = c(0, 200), pch = 19)
axis(side = 1, at = c(2009, 2010, 2011, 2012), labels = c("2009", "2010", "2011", "2012"))
arrows(x0 = lindat$year, y0 = lindat$lci, x1 = lindat$year, y1 = lindat$uci, angle = 90, length = 0.1, code = 3)
In the arrows() function the arguments specify the following:
angle: angle of the arrow…90 are 90 degree angles
length: how long should those arrows be?
code: what kind of arrow (1=arrow head at beginning, 2 = arrowhead at end, 3 = arrowhead at beginning and end)
Boxplots are created by the boxplot() function. R takes a continuous variable and then factors the conditioning variable if it is not already. Let look at length by stage in our data.
boxplot(length ~ stage, dat, xlab = "Stage", ylab = "Length (mm)")
As before, we had some strong variation among years, so it might be good to look at that! Lets plot boxplots for each stage and year combination. This is done by subsetting and adding the plots. But we have to be careful to plot them at the right location on the x-axis. Basically we have to shift them left or right from the plotting location. This can be done using the at argument. We also have to resize the boxes so that width is not so large that the boxes overlap.
boxplot(length ~ stage, dat, xlab = "Stage", ylab = "Length (mm)", subset = year ==
2009, at = c(1:2) - 0.25, boxwex = 0.1)
boxplot(length ~ stage, dat, xlab = "Stage", ylab = "Length (mm)", subset = year ==
2010, add = TRUE, at = c(1:2) - 0.125, boxwex = 0.1)
boxplot(length ~ stage, dat, xlab = "Stage", ylab = "Length (mm)", subset = year ==
2011, add = TRUE, at = c(1:2) + 0.125, boxwex = 0.1)
boxplot(length ~ stage, dat, xlab = "Stage", ylab = "Length (mm)", subset = year ==
2012, add = TRUE, at = c(1:2) + 0.25, boxwex = 0.1)
In the boxplot() function the arguments specify the following:
at: where the groups should be plotted
”boxwex: width of the box
add=TRUE: should the boxplot be added to the existing plot?
Well that was fun but that x-axis is atrocious. We can fix this with xaxt as we did before to suppress plotting of the x-axis.
boxplot(length ~ stage, dat, xlab = "Stage", ylab = "Length (mm)", subset = year ==
2009, at = c(1:2) - 0.15, boxwex = 0.08, xaxt = "n", col = "red")
boxplot(length ~ stage, dat, xlab = "Stage", ylab = "Length (mm)", subset = year ==
2010, add = TRUE, at = c(1:2) - 0.05, boxwex = 0.08, xaxt = "n", col = "yellow")
boxplot(length ~ stage, dat, xlab = "Stage", ylab = "Length (mm)", subset = year ==
2011, add = TRUE, at = c(1:2) + 0.05, boxwex = 0.08, xaxt = "n", col = "green")
boxplot(length ~ stage, dat, xlab = "Stage", ylab = "Length (mm)", subset = year ==
2012, add = TRUE, at = c(1:2) + 0.15, boxwex = 0.08, xaxt = "n", col = "grey")
axis(side = 1, at = c(1, 2), labels = c("Adults", "Juveniles"))
legend("topright", legend = c(2009:2012), fill = c("red", "yellow", "green",
"grey"))
fill: what color to fill boxes in the legend
What's a snakes favorite graphic? A hissssssstogram….
Histograms of a vector of data are easily made using the histogram() function. All you have to do is feed it a vector of data and it will bin it for you and you are off and running!
But the default is not as pretty as it could be.
hist(dat$length)
Lets clean it up some by adding a box and add some labels.
hist(dat$length, xlab = "Length (mm)", ylab = "Number of fish", las = 1)
box()
Uggg that title is atrocious, lets just get rid of it all together. We can do this by the argument main="" or main=NULL.
hist(dat$length, xlab = "Length (mm)", ylab = "Number of fish", las = 1, main = "")
box()
This code produces the same thing as the above code.
hist(dat$length, xlab = "Length (mm)", ylab = "Number of fish", las = 1, main = NULL)
box()
Lets add some color to up the ink ratio. But now we are going plot the density and not the frequency (the default is to plot the frequency).
hist(dat$length, xlab = "Length (mm)", ylab = "Density", las = 1, main = "", freq = FALSE, col = "grey")
box()
Whoa Nelly… Now I am having a hard time reading my y-axis label. We really need to customize it. We can do this using the mtext() function to plot text in the margins. Lets give it a shot. One thing we need to do is use the par() function to change the margins a bit to give us some room to work.
par(oma = c(1, 2, 1, 1)) # add an outer margin to plot in
hist(dat$length, xlab = "Length (mm)", ylab = "", las = 1, main = "", freq = FALSE, col = "grey")
box()
mtext(side = 2, "Density", outer = TRUE, line = 0)
That is much better.
Code breakdown:
oma: specifies the margin widths (in lines ~12pt), you need to enter 4 values going from bottom, left, top, right… this mathces how side is used in the axis() function
side=2: tells R to plot the text in the left margin
outer=TRUE: tells R to plot in the outer margin
line=0: plots the text on line 0
No, this is not hanging out at Bombs Away making plots… Lets go ahead and make a default barplot using our summary data lindat with the mean weights.
barplot(lindat$weight)
The plot above is almost as satisfying as plotting in Excel. Lets clean it up some and make it presentable by adding a box around it and some labels.
barplot(lindat$weight, names = lindat$year, las = 1, yaxt = "n", ylim = c(0, 250), ylab = "Mean weight (g)", width = 0.9, space = 0.1)
axis(side = 2, at = seq(0, 250, 50), labels = TRUE, las = 1)
box()
Code breakdown:
width: how wide to make the bars
space: how much space between bars
Lets tweak it some more to get it just right…our advisor wants us to add some whiskers that illustrate the uncertainty. With the function, we can take this plot to boomtown.
barplot(lindat$weight, names = lindat$year, las = 1, yaxt = "n", ylim = c(0, 250), ylab = "Mean weight (g)", width = 0.9, space = 0.1)
axis(side = 2, at = seq(0, 250, 50), labels = TRUE, las = 1)
box()
# dynamite plot
arrows(x0 = c(0.475, 1.475, 2.475, 3.475), y0 = lindat$weight, x1 = c(0.475,
1.475, 2.475, 3.475), y1 = lindat$uci, angle = 90)
Remember when we did some plots and changed the type from points, to lines to both? Well we can use 'h' to plot a barplot!
plot(weight ~ length, lindat, type = "h", lwd = 5, lend = 2, las = 1, ylab = "Mean weight (g)",
xlab = "Mean Length (mm)")
Argument breakdown:
type='h': plot histogram-type bars
lwd=5: the width of those bars should be 5
lend=2: the cap of those bars should be square (1=rounded, 2=square)
A lot of this stuff is pretty repetitive huh? Making use of functions can be very handy when there are multiple repetitive tasks.
addpoints <- function(plotyear, plotsymbol, plotcol) {
## make a funct ion to plot points for a given year
points(weight ~ length, dat, subset = year == plotyear, pch = plotsymbol, col = plotcol)
}
plot(weight ~ length, dat, type = "n", xlab = "Length (mm)", ylab = "Weight (g)")
addpoints(plotyear = 2009, plotsymbol = 1, plotcol = "red")
addpoints(plotyear = 2010, plotsymbol = 2, plotcol = "blue")
addpoints(plotyear = 2011, plotsymbol = 2, plotcol = "black")
addpoints(plotyear = 2012, plotsymbol = 2, plotcol = "black")
Is there a way to convert the above to a for loop? Challenge accepted?
We can use the par() function to also set up the plotting lay out using the mfrow argument. It is pretty simple we can specify a plot that is 2 rows by 2 columns as par(mfrow=c(2,2)). Lets give it a shot.
par(mfrow = c(2, 2), mar = c(4, 4, 0, 0.2))
plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "Weight (g)", las = 1)
plot(length ~ weight, dat, ylab = "Length (mm)", xlab = "Weight (g)", las = 1)
boxplot(weight ~ year, dat, xlab = "Year", ylab = "Weight (g)", las = 1)
boxplot(weight ~ stage, dat, xlab = "Year", ylab = "Weight (g)", las = 1)
What about a 4 rows with 1 column? Yep we can do that! Just need to specify par(mfrow=c(4,1))
par(mfrow = c(4, 1), mar = c(4, 4, 0, 0.2))
plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "Weight (g)", las = 1)
plot(length ~ weight, dat, ylab = "Length (mm)", xlab = "Weight (g)", las = 1)
boxplot(weight ~ year, dat, xlab = "Year", ylab = "Weight (g)", las = 1)
boxplot(weight ~ stage, dat, xlab = "Stage", ylab = "Weight (g)", las = 1)
This Weeks Assignment
Due 1 week from today by 5pm Pacific. Use hw_data.csv to do the following. Be sure to have informative axis and plot labels (NOTE: length was measured in millimeters and weight was in kilograms).
1. Make a plot of length on the x-axis and weight on the y-axis.
2. The data was collected over 4 years. Make a 2 row and 2 column multipanel plot for each year of data.
3. Make a 3 row by 1 column multipanel plot of the following:
1) histogram of lengths,
2) boxplot of weights by year (x-axis), and
3) a scatter plot with weight on the x-axis and length on the y-axis with the following plotting criteria:
2009 data should plotted as green plus signs
2010 data should plotted as red open diamonds
2011 data should be plotted as black filled circles
2012 data should be plotted as blue open circles
add an appropriate legend.
4. Make a function to add the yearly data used to make the plot above (plot 3.3), and use it in a for loop to recreate the figure (hint, in the function you may have to feed it a vector of pch and a vector of color values...).
5. Summarize hw_data.csv by year. Summary should include the following summary values for the weight field: mean, variance, standard deviation, and the number of observations. Make a 2 row by 1 column multipanel plot that plots the following:
1) A point/line plot of mean weight with whiskers illustrating minimum and maximum values within year.
2) A barplot of mean weights.
Submit your R code one week from today at 5pm. Remember to name each file as "lastname.*"
Additional material
NEW: This script contains code for automatically saving your figures to external files in 5 formats.
Plot of pch plotting symbols.
## LAYOUT() FUNCTION Figure 8.2 A: a 2x2 panel (2 rows and 2 columns)
layout(matrix(c(1, 2, 3, 4), 2, 2, byrow = TRUE)) # same as
par(mfrow = c(2, 2))
layout.show(4) ## show the layout that has been set up for the 4 plots
plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "Weight (g)", las = 1)
plot(length ~ weight, dat, ylab = "Length (mm)", xlab = "Weight (g)", las = 1)
boxplot(weight ~ year, dat, xlab = "Year", ylab = "Weight (g)", las = 1)
boxplot(weight ~ stage, dat, xlab = "Year", ylab = "Weight (g)", las = 1)
## a 4x1 panel (4 rows and 1 column)
layout(matrix(c(1, 2, 3, 4), 4, 1, byrow = TRUE))
layout.show(4) ## show the layout that has been set up
par(mar = c(4, 4, 0, 0.2))
plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "Weight (g)", las = 1)
plot(length ~ weight, dat, ylab = "Length (mm)", xlab = "Weight (g)", las = 1)
boxplot(weight ~ year, dat, xlab = "Year", ylab = "Weight (g)", las = 1)
boxplot(weight ~ stage, dat, xlab = "Stage", ylab = "Weight (g)", las = 1)
## a different layout for 2 plots
layout(matrix(c(1, 1, 0, 2), 2, 2, byrow = TRUE))
layout.show(2) ## show the layout that has been set up
plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "Weight (g)", las = 1)
boxplot(weight ~ year, dat, xlab = "Year", ylab = "Weight (g)", las = 1)
## Clean up the white space by adjusting the margins
layout(matrix(c(1, 1, 0, 2), 2, 2, byrow = TRUE))
layout.show(2) ## show the layout that has been set up
par(mar = c(4, 4, 0.2, 0.2))
plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "Weight (g)", las = 1)
par(mar = c(4, 4, 0.5, 0.2))
boxplot(weight ~ year, dat, xlab = "Year", ylab = "Weight (g)", las = 1)