ggplot2

The total length of the videos in this section is approximately 19 minutes. Feel free to do this in multiple sittings! There is also a lot of text to read in this tutorial.

You can also view all the videos in this section at the YouTube playlist linked here.

This lecture was created by Katharine Liang '17, so that's whose voice you will hear.

Introduction

ggplot2 is a package in R that allows you to easily produce complex multi-layered graphics. It is very popular. You are probably already familiar with plotting in R using the functions plot(), hist(), box(), and line(). Unlike these functions where all the parameters of the graph are specified in the function, in ggplot2 different parameters of the plot are specified in different functions that are literally added together with a plus sign +. Each function acts like a layer to create more complicated graphs. The syntax of creating plots with ggplot2 functions looks a bit strange at first but it is simple and easy to learn.

There are two videos on this page. The first is an introduction to ggplot2. This video will introduce the basics to building a plot in ggplot2 and give you a general idea of how the syntax works. The intro video will follow pointpoint slides that you can download just below this text. The intro video is not meant to enable you to code a plot well. It is merely an introduction to the general ideas.

The second video on the page is a walk through for the R code file that you can download just below this text. It is not meant to be a comprehensive tutorial of how to edit all the elements of a plot. It is meant to guide you through the R file that will show you the nitty gritty details. Before watching the second video you should open the R file called ggplot2 tutorial.R and take a look.

ggplot2.1.Introduction.mp4

Now is a good time to briefly look over the ggplot2 cheatsheet created by RStudio (search "ggplot2" to find the right cheatsheet)

Question 1: Which of the following lines below produce the graph above?

  • ggplot(diamonds, aes(x= carat, y=price, color=color))+geom_point()

  • ggplot(diamonds, aes(x= carat, y=price, color=color))+geom_point(color="light blue")

  • ggplot(diamonds, aes(x= carat, y=price, color="light blue"))+geom_point()

  • ggplot(diamonds, aes(x= carat, y=price), color="light blue")+geom_point( aes(color=color))

Show answer

The second option is correct. Please click this link to see an explanation and visualization of what each of the wrong options does produce - this is worth a look even if you answered the question correctly.

Question 2: What is wrong with this line of code?

ggplot(diamonds, aes(x= carat, y=price))+geom_histogram(binwidth = 0.1, bins =15)

Show answer

The argument y=price needs to be removed. If you try to run this code in R you will get this error message: "Error: stat_bin() must not be used with a y aesthetic."

The geom_histogram function cannot have an aes mapping of the y variable because histograms represent the distribution of one continuous variable with the y axis showing the frequency.

Question 3: Which line of code does NOT produce the plot above?

  • ggplot(diamonds, aes(x=carat, y=price ))+geom_point(aes(color=cut))

  • ggplot(diamonds, aes(x=carat, y=price, color=cut ))+geom_point()

  • ggplot(diamonds, aes(x=carat, y=price ))+geom_point(color=cut)

  • ggplot()+geom_point(data=diamonds, aes( x=carat, y=price, color=cut))

Show answer

The 3rd line does not produce the plot because we did not specify the color to equal cut inside the aes function. We get an error because R is trying to set the color to an object called cut, not to the variable in the data set called cut.

Running the R code file

ggplot2.2.RFileWalkthrough.mp4

If you would like a more detailed explanation of the functions than provided in the video above, check out this video and this video by Roger Peng.

Our video suggests a link listing various geoms, but that link is no longer active. Here is a current one.

Also, our video and code refer to the package plotly, which is great for creating interactive graphics. Our plotly tutorial is in progress, but feel free to explore plotly on your own. It is sometimes slow, as you saw when we tried to use it for ggplot2. It is not complicated to make an interactive graphic, though: it can be as simply as making a ggplot, installing the plotly package, and opening your ggplot via a function called ggplotly. See here.

Question 4: What kind of layer would I want to add to change the color of the background of the plot?

  • annotate()

  • scale()

  • the aes mapping in geom()

  • theme()

Show answer

The last choice, theme(), is the correct answer. Adding a theme layer allows you to control the non data elements of the plot. The background color of the plot is a non data element that can be specified as a theme element inside the theme function.

Here is an example of changing the background:

For example we can make the background color of the plot below black by running this code. The graphic is shown below.

ggplot(data=diamonds, aes(x=carat, y=price))+geom_point(color="white")+theme(panel.background=element_rect(fill="black"))

Question 5: Which line of code will produce the graphic below?

  • ggplot(diamonds, aes(carat, price))+stat_smooth(geom="line",color="blue")

  • ggplot(diamonds, aes(carat,price))+geom_smooth(stat="summary")

  • ggplot(diamonds, aes(carat, price))+geom_smooth(stat="smooth")

  • ggplot(diamonds, aes(carat, price))+stat_smooth(geom="point", level=0, color="blue")

Show answer

Question 6: Which line of code will produce the graphic below?

  • ggplot(diamonds, aes(carat, price))+geom_point(color="grey")+annotate("text", x=3, y=5000, label = "sum(hat(p['i']),i==1, n)", parse=TRUE, cex=10)

  • ggplot(diamonds, aes(carat, price))+geom_point(color="grey")+annotate("text", x=3, y=5000, label = "sum(hat(p['i']),i==1, n)", cex=10)

  • ggplot(diamonds, aes(carat, price))+geom_point(color="grey")+annotate("text", x=3, y=5000, label = "sum(hat(p_i), n)", parse=TRUE, cex=10)

  • ggplot(diamonds, aes(carat, price))+geom_point(color="grey")+annotate("text", x=3, y=5000, label = "sum((p_i), n)", cex=10)

Show answer

The first line of code creates the correct plot. The label matches the annotation on the plot and we specified that we want to parse it.

The second line of code is missing the parse=TRUE argument

The third line of code creates a plot with a different label. The i==1 is missing.

The last line creates a plot with an incorrect label and is missing the parse=TRUE argument.

That's all here. However, this gg-based way to handling visualization in R is popular, and there are additional, related packages and techniques that you can find if you are interested.