Week 8: R graphics-how to make your data look pretty...

By the end of this lecture you will know.....

Scatter plots
Line plots
Bar plots
Histograms
Boxplots (typical, side by side, grouped)
Multi-panel plots and layouts

The script for today's lesson is here and the data file is here.

Note --- We will not be making fancy "infovis" type graphics... Just plain ole' boring graphics you will see in reports and publications.

Read in the data and summarize

Set the working directory

# setwd('.../your directory here/...')

dat <- read.csv("dat.csv")

What the data looks like.

head(dat)

## X ID year length weight stage

## 1 1 1 2009 701.0 30.30 Adult

## 2 2 2 2012 593.9 52.78 Adult

## 3 3 3 2012 729.5 101.98 Adult

## 4 4 4 2010 563.2 65.93 Adult

## 5 5 5 2011 231.1 10.94 Juv

## 6 6 6 2012 723.0 77.13 Adult\

str(dat)

## 'data.frame': 1000 obs. of 6 variables:

## $ X : int 1 2 3 4 5 6 7 8 9 10 ...

## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...

## $ year : int 2009 2012 2012 2010 2011 2012 2011 2010 2012 2009 ...

## $ length: num 701 594 729 563 231 ...

## $ weight: num 30.3 52.8 102 65.9 10.9 ...

## $ stage : Factor w/ 2 levels "Adult","Juv": 1 1 1 1 2 1 1 1 1 1 ...

summary(dat)

## X ID year length

## Min. : 1 Min. : 1 Min. :2009 Min. :200

## 1st Qu.: 251 1st Qu.: 251 1st Qu.:2009 1st Qu.:338

## Median : 500 Median : 500 Median :2010 Median :489

## Mean : 500 Mean : 500 Mean :2010 Mean :493

## 3rd Qu.: 750 3rd Qu.: 750 3rd Qu.:2011 3rd Qu.:644

## Max. :1000 Max. :1000 Max. :2012 Max. :800

## weight stage

## Min. : 0.4 Adult:857

## 1st Qu.: 10.2 Juv :143

## Median : 31.3

## Mean : 76.6

## 3rd Qu.: 90.8

## Max. :700.5

names(dat)

## [1] "X" "ID" "year" "length" "weight" "stage"

Just how much data are we dealing with?

dim(dat)

## [1] 1000 6

nrow(dat)

## [1] 1000

ncol(dat)

## [1] 6

Ok, all looks good. Let’s start plotting.

Scatter plots

The following code will make the figure below.

plot(weight ~ length, dat)

Figure 1.—Scatter plot figures produced by R code in this section.

Making a default scatterplot

Lets start with panel a. This is your basic plot of length versus weight. The default R code to do this is

plot(weight ~ length, dat)

Note the formula notation which is y variable ~ x variable you could as well, but I prefer formula notation.

plot(dat$length,dat$weight)

Changing the plotting character

Super, we have a plot of some data! Very satisfying! But are you satisfied?… Or rather is your advisor satisfied?

No, your advisor wants those circles to be filled…ugh!

We can do this by changing the pch and we can reproduce panel b from Figure 1. See the material at the end of the lab for the pch values that correspond to the plotting values.

plot(weight ~ length, dat, pch = 19) # filled circles!

Oh, I forgot to mention you have co-advisors - and your other advisor thinks they should be filled in squares (her thought process is more angular!). Easy peasy, just use pch again but this time with a value of 15.

plot(weight ~ length, dat, pch = 15) # filled squares

Wildcard! What about filled in triangles? Again easy peasy, just use pch again but this time with a value of 17.

plot(weight ~ length, dat, pch = 17) # filled triangles

Adding custom x- and y-axis labels

That was a fun exercise, but we come full circle and have decided that filled in circles will work best (full circle… get it?). But, being good researchers we want to change the x and y-axis labels to incorporate the units. We need to do this because R by default uses the name of the column. So we could rename the column in the data.frame, but there has to be a better way. All we need to add a custom axis label by adding a xlab and ylab argument. The following code demonstrates this.

plot(weight ~ length, dat, pch = 19, xlab = "Length (mm)", ylab = "Weight (g)")

Changing the plotting x- and y-axis limits

We can also use the xlim and ylim argument to change the graphic area plotted. The code below produces a figure that limits the x- and y-axis to be between 0 and 400.

plot(weight ~ length, dat, pch = 19, xlab = "Length (mm)", ylab = "Weight (g)", xlim = c(0, 400), ylim = c(0, 400))

Plotting groups: layers on a blank canvas

Wow! The data looks like it has some structure in it that we could illustrate by grouping some of the data and plotting it using different colors for hears in the data.

First we are going to start with a 'blank canvas to plot on by specifying type='n' which will give us the plot below.

plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "Weight (g)", xlim = c(200, 800), ylim = c(0, 800), type = "n")

Now we can add things to the plot. This is accomplished using the points() function.

Lets add some data for 2009. The points() function has a subset argument that we can specify a column in the data and some sort of subsetting argument (Remember back to Week 4? all that applies here)

plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "weight (g)", xlim = c(200,

800), ylim = c(0, 800), type = "n")

points(weight ~ length, dat, subset = year == 2009, pch = 19)

points(weight ~ length, dat, subset = year == 2010, pch = 1)

points(weight ~ length, dat, subset = year == 2011, pch = 10)

points(weight ~ length, dat, subset = year == 2012, pch = 15)

But the above figure is a bit difficult to differentiate, and I certainly wouldn't use in a presentation. How about we jazz it up with some color by adding a col argument?

plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "weight (g)", xlim = c(200,

800), ylim = c(0, 800), type = "n")

points(weight ~ length, dat, subset = year == 2009, pch = 19, col = "red") # add data for 2009

points(weight ~ length, dat, subset = year == 2010, pch = 1, col = "black") # add data for 2010

points(weight ~ length, dat, subset = year == 2011, pch = 10, col = "blue") # add data for 2011

points(weight ~ length, dat, subset = year == 2012, pch = 15, col = "green") # add data for 2012

Wow, that really pops! Now how do we know what means what? We need a legend, not a mythical story but a way to make the colors and symbols mean something.

Adding a legend to a plot

The bare minimum that we need to specify to the legend() function is where it should be located, what is the legend, what are the symbols used, and what color are those symbols. Be careful here, order is important! Lets recreate the plot above and add a legend.

plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "weight (g)", xlim = c(200,

800), ylim = c(0, 800), type = "n")

points(weight ~ length, dat, subset = year == 2009, pch = 19, col = "red") # add data for 2009

points(weight ~ length, dat, subset = year == 2010, pch = 1, col = "black") # add data for 2010

points(weight ~ length, dat, subset = year == 2011, pch = 10, col = "blue") # add data for 2011

points(weight ~ length, dat, subset = year == 2012, pch = 15, col = "green") # add data for 2012

## Adding a default legend

legend("top", legend = c("2009", "2010", "2011", "2012"), pch = c(19, 1, 10,

15), col = c("red", "black", "blue", "green"))

The use of 'top' as the first argument tells R where to plot the legend. The rest gives the text legend, symbol and color. The location argument can be: “topleft, top, topright, left, right, bottomleft, bottom, bottomright

The code below makes the same previous plot but with the legend in the top left hand corner.

plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "Weight (g)", xlim = c(200,

800), ylim = c(0, 800), type = "n")

points(weight ~ length, dat, subset = year == 2009, pch = 19, col = "red") # add data for 2009

points(weight ~ length, dat, subset = year == 2010, pch = 1, col = "black") # add data for 2010

points(weight ~ length, dat, subset = year == 2011, pch = 10, col = "blue") # add data for 2011

points(weight ~ length, dat, subset = year == 2012, pch = 15, col = "green") # add data for 2012

## Adding a default legend

legend("topleft", legend = c("2009", "2010", "2011", "2012"), pch = c(19, 1,

10, 15), col = c("red", "black", "blue", "green"))

Some personal preferences.

I prefer my y-axis labels to be parallel to the x-axis which can be specified by the argument las=1. I also prefer my legends to not have a box which is taken care of with the argument bty='n'

plot(weight ~ length, dat, xlab = "Length (mm)", ylab = "Weight (g)", xlim = c(200,

800), ylim = c(0, 800), type = "n", las = 1)

points(weight ~ length, dat, subset = year == 2009, pch = 19, col = "red") # add data for 2009

points(weight ~ length, dat, subset = year == 2010, pch = 1, col = "black") # add data for 2010

points(weight ~ length, dat, subset = year == 2011, pch = 10, col = "blue") # add data for 2011

points(weight ~ length, dat, subset = year == 2012, pch = 15, col = "green") # add data for 2012

## Adding a default legend

legend("topleft", legend = c("2009", "2010", "2011", "2012"), pch = c(19, 1,

10, 15), col = c("red", "black", "blue", "green"), bty = "n")

Line plots

Most of what we have learned for scatterplots applies to line plots! Lets plot the mean weight by year for the data as a line plot. First we need to aggregate the dataset in the code below and make a new dataframe lindat.

Aggregating the dataset

lindat <- aggregate(weight ~ year, dat, mean)

lindat$length <- aggregate(length ~ year, dat, mean)$length

lindat$n <- aggregate(weight ~ year, dat, length)$weight

lindat$var <- aggregate(weight ~ year, dat, var)$weight

lindat$lci <- lindat$weight - 1.96 * sqrt(lindat$var)/sqrt(lindat$n)

lindat$uci <- lindat$weight + 1.96 * sqrt(lindat$var)/sqrt(lindat$n)

The code we will be working with will step through lindat created by the code above to plot panels A-D below.

Default line plots

Line plots are done by specifying the argument type='l'. By default it is type='p', which returns a scatter plot as we just worked through. Lets plot mean weight by year using the code below

plot(weight ~ year, lindat, type = "l", las = 1, xlab = "Year", ylab = "Mean weight (g)")

Well that plot is a bit misleading, we don't actually have data for all points on the line but it illustrates trends really well. How about a mix of points for data and a line to illustrate trend. This can be done using the argument type='b' where the b is short for both as in both points and lines.

plot(weight ~ year, lindat, type = "b", las = 1, xlab = "Year", ylab = "Mean weight (g)")

OK now I am just nit picking this thing to death, but that is what we do right? What is year 2010.5? Especially if we don't have any data? We can deal with this by turning off the x-axis in the plot using the xaxt='n' argument and specifying our own custom one using the axis() function.

# fix the x-axis (xaxt='n')

plot(weight ~ year, lindat, type = "b", las = 1, xlab = "Year", ylab = "Mean weight (g)", xaxt = "n")

axis(side = 1, at = c(2009, 2010, 2011, 2012), labels = c("2009", "2010", "2011", "2012"))

In the axis() function the arguments specify the following:

side: what side to add the axis to (1=bottom, 2=left, 3=top, 4=left)
at: where to put tick marks
labels: what to label those ticks marks

Adding confidence intervals

Lines illustrating confidence intervals (or any sort of line you may want) can be added using the segments() function. This function requires two points, a start (x0, y0) and an end (x1, y1). To add lines representing a confidence interval all you need to do is specify the x location and then the lower and upper CI.

Figure 3.—The code below was used to construct this figure demonstrating how confidence intervals can be illustrated on plots using the segments() and arrows() functions

plot(weight ~ year, lindat, type = "b", las = 1, xlab = "Year", ylab = "Mean weight (g)", xaxt = "n", ylim = c(0, 200), pch = 19)

axis(side = 1, at = c(2009, 2010, 2011, 2012), labels = c("2009", "2010", "2011", "2012"))

segments(x0 = lindat$year, y0 = lindat$lci, x1 = lindat$year, y1 = lindat$uci)

We can get more traditional 'whiskers' using the arrows() function and specifying that the arrow angle using the arguement angle=90. The length of the end arrow is how wide it will be on the graph (measured in inches).

# whiskers

plot(weight ~ year, lindat, type = "b", las = 1, xlab = "Year", ylab = "Mean weight (g)", xaxt = "n", ylim = c(0, 200), pch = 19)

axis(side = 1, at = c(2009, 2010, 2011, 2012), labels = c("2009", "2010", "2011", "2012"))

arrows(x0 = lindat$year, y0 = lindat$lci, x1 = lindat$year, y1 = lindat$uci, angle = 90, length = 0.1, code = 3)

In the arrows() function the arguments specify the following:

angle: angle of the arrow…90 are 90 degree angles
length: how long should those arrows be?
code: what kind of arrow (1=arrow head at beginning, 2 = arrowhead at end, 3 = arrowhead at beginning and end)

Boxplots

Boxplots are created by the boxplot() function. R takes a continuous variable and then factors the conditioning variable if it is not already. Let look at length by stage in our data.

boxplot(length ~ stage, dat, xlab = "Stage", ylab = "Length (mm)")

As before, we had some strong variation among years, so it might be good to look at that! Lets plot boxplots for each stage and year combination. This is done by subsetting and adding the plots. But we have to be careful to plot them at the right location on the x-axis. Basically we have to shift them left or right from the plotting location. This can be done using the at argument. We also have to resize the boxes so that width is not so large that the boxes overlap.