06. Plotting with R

Basic Graphs

Most people start using R for its graphs and they are probably right to do so. R has a huge toolbox of functions and packages for plotting simple data, time series, big data, complex data structures, maps etc. Most importantly this toolbox is ever-expanding due to R being open source, which means that there is a constant flow of new functions for intricate plots and graphics in the repositories.

R's way of producing graphs differs from programs you may have used before (such as Excel or Graphpad) in the sense that it is not so much based on a graphical user interphase as it is on the command line. This may sound like a drawback in the beginning but once you get accustomed with the process of calling plot functions and customizing them on the command line you can store everything in custom scripts that will produce elegant, complicated graphs with the strike of a button.

Nevertheless, even the most elegant graphs are based on some very simple rules. We will get to know them starting from the most basic plots and will gradually add layers of "elegance" step-by-step. Lets get started assuming you have a table of numerical data and you want to see how each piece of data fluctuates against the rest. In such a case a simple barplot will be enough to give us an idea.

barplot

the basics

Let's start by first getting some data into R. At the bottom of this page you will find a short data frame containing some data on a genomic partition of the mouse genome called "genome_partition.tsv". After you download it (and store it in a folder from which you start R) try loading it in R with the command you already know

data<-read.delim("genome_partition.tsv", sep="\t", header=T)

This will read the data by separating columns in tabs (it is a "tsv" after all) and into a data frame keeping the first row of the file as header. A brief inspection of the dataset will show you that it consists of 4 columns the first being names of genomic regions (such as "proximal_promoters_2k_TSS" and "genebody") and three numerical columns carrying specific values for a certain score, the number of elements in each region and the total size in the genome for each region (don't try to figure out to what all these correspond for the time being).

Now lets see how the scores of the first column fluctuate for each region in a simple R barplot. To do this we simply need to tell R to use barplot() on the 2nd column of data. This includes one subsetting step

x<-data[,2]

and one plotting step

barplot(x)

or if you like, two steps at once

barplot(data[,2])

Upon calling the barplot() function (or any plotting function) a new window will pop-out of R carrying the created plot (if you are using R studio you may be able to see it in a separate window. Inspect the plot you just created. It looks a bit bear of information. Other than the value axis it carries little more than the height of the plots. Most importantly, you can't say what is what. To see that you will need to fill in the names option in barplot(). The names will be the first column of data. Information such as names of bars, names of axis, titles and legends may be introduced in the plotting function as arguments. In the case of names of bars it's simply names="". In this case, simply write

barplot(data[,2], names=data[,1])

now the names of the genomic regions are attributed to the corresponding columns, only the names are two long and instead of overlapping them, R has placed only some of the names on the plot. There are two ways around this. We can either change the orientation of the names on the plot from horizontal to perpendicular. This is done with the las=2 option that rearranges the axes direction.

barplot(data[,2], names=data[,1], las=2)

but a more elegant way would be simply making the bars horizontal and putting the names on the side

barplot(data[,2], names=data[,1], horiz=T, las=1)

where we switch to las=1 for the numerical values to be shown in horizontal mode as well. Even so, the names are too long for us to see their full names. We need to increase the margin on the left to see them. We can do this by adjusting the par() function. The par() function takes a lot of arguments that are related to how we use the space, the colouring and the layout of plots (among other things). One of the par() arguments are passed to the mar= option that has to do with the margins of the plot. mar accepts a four-element vector of numbers that correspond to the bottom, left, top and right margin sizes (in this order). We can thus increase the left margin (where the names of the barplot are written) by passing a suitable vector to mar within par(). Don't be confused. It's as easy as

par(mar=c(5,15,5,5)) barplot(data[,2], names=data[,1], horiz=T, las=1)

see how by setting a very high value for the left margin (15 as opposed to 5 for all the rest) we were able to fit the names in the plot. Now let's complete it by adding names to the axis and the whole plot, like this

barplot(data[,2], names=data[,1], horiz=T, las=1, main="Score Barplot", xlab="raw score")

see how main= receives the argument for the title of this plot and xlab= receives the title for the values on the x axis. Both arguments are simple strings of characters encoded within double quotation marks "".

advanced barplots

Now lets move to explore the potential of barplots a bit more. For this we will make use of a built-in dataset that comes preinstalled with R's vcd package. vcd is a package that specializes in the visualization of categorical data (this is what vcd stands for). Out of this package we will load the Arthrits dataset, simply by

library(vcd) # assumes vcd is installedArthritis

This loads a dataframe that carries the ID numbers of 84 Arthritis patients receiving treatment or a placebo for whom we are provided the treatment, the sex, the age and the outcome of the treatment in 3 possible outcomes (Marked, Some and None). Lets start by briefly inspecting how many of the patients showed each outcome

par(mar=c(5,5,5,5)) barplot(table(Arthritis$Improved), las=1, main="Treatment Outcome", xlab="Number of Patients")

in these two commands we first restore the margins to symmetrical levels then plot in one step the output of a table created for the outcome of the treatment in the Arthritis patients (column names $Improved).

Now lets see how can we visualize more information combining the outcome of the treatment with the treatment itself (Treated or Placebo). For this we can use the table() function on two attributes (columns) of the Arthritis dataset. Try this with

table(Arthritis$Improved, Arthritis$Sex)

You will see that the output is a contingency table listing how many patients of each sex showed each treatment outcome. Now let's plot this with barplot.

barplot(table(Arthritis$Improved, Arthritis$Sex), las=1, main="Treatment Outcome", xlab="Number of Patients")

This produces two bars, each of which corresponds to a sex and is proportionally split to the corresponding treatment outcomes. One obvious problem has to do with the lack of colour legend. We don't know which coloured band corresponds to which outcome. Another problem is that the way the bands are laid out we cannot get a very clear idea on the total numbers. In order to see this we can ask barplot() to plot all data in separate bars by adding the beside=T option

barplot(table(Arthritis$Improved, Arthritis$Sex), beside=T, las=1, main="Treatment Outcome", xlab="Number of Patients")

Now each piece of data has its own separate bar but we still need to figure out which is which. We can visualize a legend by asking barplot to use the rownames of the table() output passing them to the legend option.

barplot(table(Arthritis$Improved, Arthritis$Sex), beside=T, las=1, main="Treatment Outcome", xlab="Number of Patients", legend=rownames(table(Arthritis$Improved, Arthritis$Sex)))

but lets make it a little less complicated by first storing the output of table to a variable (say tab) and then calling barplot on tab.

tab<-table(Arthritis$Improved, Arthritis$Sex) barplot(tab, beside=T, las=1, main="Treatment Outcome", xlab="Number of Patients", legend=rownames(tab))

and finally, lets give the barplot some more vivid colouring instead of the dull greyscale by passing a vector of three colours to the col= option

barplot(tab, beside=T, las=1, main="Treatment Outcome", xlab="Number of Patients", legend=rownames(tab), col=c("dark red", "orange", "yellow"))

We will see more about the use of colouring in plots as we move to more complex plotting types and options. Our barplot now looks both nice and informative. We have to blocks of bars, one for each sex, carrying three bars each one for each treatment outcome. From these we can say that there are more female than male patients (this is true for Arthritis as there is a 3-4:1 incidence ratio between women and men). It also looks as if there are more patients with a "marked" treatment outcome among women than among men, while there are more patients with "none" of an outcome among men. The fact, however, that there are more women than men does not allow us to safely say so. How can we get a clearer idea about this? What we need is, instead of the absolute numbers, the proportion of treatment outcomes for each sex separately. We need, as we say, to normalize the data with regard to patient number. How can we do this? There are many ways we can go about it. The most straight-forward is to take the table() output tab and dividing the elements of each column (one for female and one for male) with their sum, thus transforming the absolute numbers to relative ratios (normalizing). First lets take a look at the tab variable

tab Female Male None 25 17 Some 12 2 Marked 22 6

Now check that the first column of tab represents female data and the second male data

tab[,1] None Some Marked 25 12 22 tab[,2] None Some Marked 17 2 6

And now see how sum() can act on each column

sum(tab[,1]) [1] 59

which is the some of 25+12+22, that is the female column

Now simply by dividing tab[,1] with its sum() we get

tab[,1]/sum(tab[,1]) None Some Marked 0.4237288 0.2033898 0.3728814

which we can store in a variable called female

female<-tab[,1]/sum(tab[,1])

and do the exact same thing for male (column 2 of tab)

male<-tab[,2]/sum(tab[,2])

We now have two vectors, one for each sex holding the relative ratios. Before passing them to barplot as we did for tab we need to combine them in one matrix. We will do this by binding them by column in a data.frame

sexes<-data.frame(female,male)

Now we can plot sexes the same way we did with tab

barplot(sexes, beside=T, las=1, main="Treatment Outcome", xlab="Number of Patients", legend=rownames(sexes), col=c("dark red", "orange", "yellow"))

This turns back with an error message saying it won't accept data that are not vectors or matrices. Barplot only works with those while sexes was built as a data.frame. Don't panic. As we know that our data.frame contains only numeric data we can safely coerce it to a matrix by the as.matrix() function. Simply typing

barplot(as.matrix(sexes), beside=T, las=1, main="Treatment Outcome", xlab="Number of Patients", legend=rownames(sexes), col=c("dark red", "orange", "yellow")

provides what we wanted in the first place. Notice how the y-axis now runs from 0 to some value that cannot be greater than one since it refers to the ratio of treatment outcomes per sex. We can now see clearly that there are more non-responding patients among men and more responding patients among women, even after controling (normalizing) for number of patients. Lets see this even more clearly with one last trick in a barplot that shows the ratios piled one upon the other, instead of as separate barplots. Simply setting the beside= option to F, we get

barplot(as.matrix(sexes), beside=F, las=1, main="Treatment Outcome", xlab="Number of Patients", legend=rownames(sexes), col=c("dark red", "orange", "yellow"))

that is two bars, one for each sex, where the ratio of each outcome is delineated as a stripe, with a width proportional to the ratio. As data are normalized, both bars go up to 1 (the total of all three outcome ratios) but it is exactly this normalization that permits us to see the significant differences in treatment outcomes. (Data scientists like to call this sort of barplot a "spinogram" in the sense that it's striped pattern represents a series of spines.) We have now covered more than the basics of barplots. Before moving to more complicated two-dimensional plots, lets take a look at a similar approach that many people (or simply those with a greater love for food and economics) seem to prefer. Pies!

Pies (for the lazy ones)

Economists (or more precisely old-fashioned economists) and the business world in general love pies but the truth is that we can perceive heights and distances easier than surfaces and for this reason pies are generally worse for describing data than say barplots or point plots. Still, R has a basic function called pie() for those old-fashioned types that like to see their data depicted as such. In the case of our table formed in the previous section we could choose to plot the data for the treatment outcomes for the females in the form of a pie like this

pie(sexes[,1], labels=rownames(sexes), col=c("dark red","orange","yellow"))

As one may understand, it is not possible to visualize the data for both sexes in one pie, so a pie can only hold data for a vector. What if we wanted to see the data for male and female in two pies at once? We would then have to split the screen in two and plot one pie in each half. This is done by calling a different option of the par() function (remember we used par(mar=()) to set the margins of the screen). The option for the splitting of the screen is called mfrow() and it accepts a vector of length 2 corresponding to the number of a) rows and b) columns to which the screen will be split. In this case if we wanted to plot two pies side-by-side we would like to split the screen in two columns on a single row

par(mfrow=c(1,2))

where c(1,2) corresponds to 1 row, 2 columns. By calling the pie function twice for each column in sexes we get

pie(sexes[,1], labels=rownames(sexes), col=c("dark red","orange","yellow")) pie(sexes[,2], labels=rownames(sexes), col=c("dark red","orange","yellow"))

which is what we wanted. Had we desired to see the pies one below the other we would have split the screen in two rows and one column by

par(mfrow=c(2,1))

You may see that in the two-pie plot, we still cannot say which pie is which. We may understand intuitively that the female is the one on the left since we called the pie() function first on the female data but this doesn't help someone who just sees the plot for the first time. We need to somehow tag the pies with the corresponding name for each dataset. We do this by filling in the main= option to the pie which (just like for the barplot and other plots) it will add a main title.

pie(sexes[,1], labels=rownames(sexes), col=c("dark red","orange","yellow"), main="Female") pie(sexes[,2], labels=rownames(sexes), col=c("dark red","orange","yellow"), main="Male")

And now we are done. Remember to reset the screen back to 1 row, 1 column after your job is done with

par(mfrow=c(1,1))

And since there's not much more to say about the (boring) pies, let's move on to more serious functions.

Various simple plots with plot

Barplots are good for singular pieces of data, in the sense that we are looking at numerical values referring to some categorical cases. In the previous examples we saw how numerical values for categorical data can be visualized in the forms of bars and pies. Let's now move on to see how we can visualize relationships between numerical data. One of the most common, yet informative and insightful ways to look into your dataset's structure and relationships are simple scatterplots, that is points in a two-dimensional space. R's plot() function is the most straight-forward solution to for this. We will now demonstrate some of its capabilities using the "faithful" built-in dataset, that holds the duration of the Old Faithful Geyser eruptions vs the waiting time between them for 271 consecutive eruptions.

head(faithful) eruptions waiting 1 3.600 79 2 1.800 54 3 3.333 74 4 2.283 62 5 4.533 85 6 2.883 55

In the simplest approach possible we would like to see if there is a connection between the two pieces of data, that is if the duration of the eruption is somehow related to waiting times. We can plot these two values against each other with

plot(faithful[,1], faithful[,2])

or (in a classier way)

plot(faithful$eruptions, faithful$waiting)

This produces a simple scatterplot with the eruptions' duration on the x axis and the waiting times on the y axis (in the order they appear in plot()). Simple means that we only see points as white circles with minimum annotation of axis, labels etc. We can add all of those in our plot by using specific attributes of the plot() function. Here I will try to show as many of those at once

plot(faithful$eruptions, faithful$waiting, col="dark grey", pch=19, xlim=c(1,6), ylim=c(40,100), xlab="Eruption Duration", ylab="Waiting Time", main="Old Faithful Eruptions", cex.axis=1.2, cex.lab=1.3, cex.main=1.5)

Too many stuff at once? Let's take a look at them one by one. First of all we change the colour of the white circles into dark grey using the col= option. Notice that we made the circles bigger setting the pch= option to 19. We could have plotted stars instead of circles by setting pch=8, or squares by setting pch=15. Next we extend the limits of the plot by manually setting the lower/upper boundaries of both axis. In this sense xlim=c(1,6) means that the x-axis will run from 1 to 6. R by default sets axes boundaries to the minimum and maximum values of the dataset, unless told otherwise. Next we set names for the axes labels with xlab= and ylab= as well as for the whole plot with main= . Finally we set the font size for the numerical values of the axes with cex.axis=, the axes labels with cex.lab= and the main title of the plot with cex.main= . All cex.()= values are numerical >0 with 1 equaling the default. Use values <1 for smaller fonts and >1 for bigger.

This way, we have covered most of the basic aspects of a scatterplot. Of course there are much more to plots than simple scatterplots and in fact there are much more to scatterplots themselves. R carries a great number of options for plotting even the simplest graphs making them appear at the same time, precise, elegant and informative. Remember that you can get a glimpse of an R function's options by seeking help with "?". If you simply type:

?plot

you will get a listing of all the options the plot function carries plus a number of interesting examples of its use at the end of the help file.

Before we move on lets see some different types of plots that are formed with two dimensional data and which are not scatterplots. Lets assume the (not very unlikely) case that you have two-dimensional data in which the first variable may be seen as a series of something. This can be space or time data or even simply enumerations of a ranked variable (1,2,3 ...etc). In such cases we would often want to plot the two variables in a way that permits us to see the fluctuation of one against the other. Consider the faithful dataset above. We may be asked to plot the duration of each eruption for a series of say the first 100 eruptions. The durations would be equal to

durations<-faithful[1:100,1]

or alternatively

durations<-faithful$eruptions[1:100]

The two commands above have identical outcomes, just a different way of subsetting. Now the series we want is a string of numbers from 1 to 100. We can simply create this with

c(1:100)

in the simplest case. A more elaborate way is with the use of the seq() function. seq() creates a sequence of not only integers but also real numbers given a lower and upper limit alongside a step. If we want the integers from 1 to 100 we simply call

x<-seq(from=1, to=100, by=1)

but seq() also works with decimal numbers or even backwards like

seq(from=1000, to=500, by=-2.5)

Coming back to our problem we can simply create a series called x that would be the rank of the eruption

x<-seq(from=1, to=100, by=1)

And now we can plot the eruptions against the series with plot

plot(x, eruptions)

But this plot doesn't look very informative. In fact it looks more like a scatterplot. What if we connected the dots with a line by setting type="l"?

plot(x, eruptions, type="l")

We can also modify the width of the line (lwd) and its color (col) like this. lwd stands for linewidth, default being one. col is just a way to set the colour

plot(x, eruptions, type="l", col="red", lwd=3)

Better? What if we connected the dots but kept them embedded in the line like beads on a string with type="o" or plot them as a continuous filled contour with type "h"? We can even colour this contour at will with col="any color you like"

plot(x, eruptions, type="h", col="dark green")

By now we've seen there are quite a few options even with the simple plot() function. Last but not least, what if one wanted to see more than one plots in the same graph, say plot the first 50 and the second 50 eruption durations next to each other and see how they differ. R has an easy way to do this by combining a plot() function with a lines() function. lines() is called after a plot has already been plotted to add an additional dataset to the graph. It is important though that the x-coordinates are of the same range. In this case we simply need to call

x<-seq(from=1, to=50, by=1)

y1<-durations<-faithful$eruptions[1:50]

y2<-durations<-faithful$eruptions[51:100]

and then call plot() first

plot(x, y1, type="l", col="red", lwd=3, ylab="Eruption Duration", xlab="Duration No", ylim=c(1,6))

and lines() after that

lines(x, y2, type="l", col="blue", lwd=3)

The plot looks rather fuzzy since the durations are very similar. One last thing before we wrap up has to do with annotation of the plots. Which is the red and which is the blue line? What do they correspond to? We should always make sure our graphs are self-explanatory. After all, we are plotting data to convey information. We need a legend to denote that the red line corresponds to the 1st-50th eruptions and the blue to the 51st-100th ones. R has a function just for that with the very well-suited name legend(). In our case it could be something like

legend("topright", c("first 50", "second 50"), col=c("red", "blue"), lwd=c(3,3), lty=c(1,1), bty="n", cex=1.3)

In brief legend() needs one argument for placement (one of "topright", "top", "bottomleft" etc) followed by vectors that contain the text, the line colors, the line types (lty=c(1,1)) and line widths in the correct order (notice how "first 50" is firs in the c() argument and "red" is also first in the col=() argument). bty="n" tells R not to draw a box around the legend and cex gives a notion of its size (default is 1). Legends are very important and you should always include them in all your graphs. Remember that even if noone is to look at them it is very likely that you yourselves will need them next time you take a look at your data.

Histograms with hist

In a previous section we saw how we can calculate simple statistics on datasets. Lets now get started with histograms. Histograms are the easiest way to graphically describe a distribution of variables be them continuous or discrete. Let's assume a normally distributed variable x.

x<-rnorm(1000)

Calling of hist on a vector x is done with:

hist(x)

Histograms are formed through the discretization of the variable space in a specified number of bins and the calculation of either the number or the frequency of the distribution's elements (the pieces of your data) falling within each range of values defined by each bin. You may imagine the bins as a set of boxes covering the entire range (max-min) of values in your dataset. The number of the boxes is inversely related to the size of the bins which is constant. Enumeration of values occurs automatically with the hist() function in R. What hist() does is that given a vector of numerical values it will split the range of the variable in a number specified by the user with the option breaks. Compare

hist(x, breaks=10)

with

hist(x, breaks=100)

hist then calculates a) the counts, that is the actual number of elements in each bin b) the densities which is the probability density of each bin. hist() plots counts or densities depending on the freq option. If freq is set to FALSE, densities will be plotted, which means that the integral of the histogram will be equal to one. Notice the difference in the height of the y-axis in the two following commands

hist(x, breaks=100, freq=T) hist(x, breaks=100, freq=F)

Histograms accept a number of options that we have already covered in other plotting functions, regarding labels and coloring. For example, plotting the horse power of cars in a cars dataset called mtcars.

cars<-mtcars

hist(cars$hp, breaks=10, col="red", freq=F, xlab="Horse Power")

In the example above we take the hp variable from the mtcars data.frame and plot its histogram in red color with 10 breaks. We also ask that the density and not the counts are returned.

Histograms are not only good for depicting/plotting but can also serve for the analysis per se, that is one can obtain the data that are produced with a certain binning and for a certain density/count. Let's assume that you need the data produced by hist in the form of a an x,y table for further use. This can be done by harvesting the output of hist() on a certain vector. This is done like this:

data<-rnorm(1000, mean=0, sd=1) # creating random data

x<-hist(data, breaks=10)$mids

y1<-hist(data, breaks=10)$counts

y2<-hist(data, breaks=10)$density

What we did in the last three commands is that we asked for specific variables created by hist() to be passed onto new variables: mids stand for the midpoints of the bins we created, while counts and density have already been discussed. We can know use x, y1 and y2 freely for plotting in our own way. Lets say that we need to plot the histogram not in columns but in connected dots of red color. We can simply use plot() on the data we have just created with hist() like this:

plot(x, y1, type="o", col="dark red", lwd=6)

In this way we have made use of hist() only to create the histogram data but we can now plot them as we like. Before we move on, one last option/function that comes coupled with hist() (but not only). It is the rug() function. rug() is called on a vector just like hist to produce, well..., a rug of data that lie at the bottom of horizontal plots like hist or plot. To get an idea of how rug works simply call it on the same vector used above for the creation of a histogram

hist(data, breaks=10)

rug(data)

You see that rug() creates a layout at the bottom of the graph that is representative of the density of elements. Each tick of the rug corresponds to one instance of the vector. In this way you can get an extra view of your dataset coupled with your histogram. rug() can be called next to a number of plots including plot() as long as they refer to the same coordinates. In this case it would be like:

plot(x, y1, type="o", col="dark red", lwd=6); rug(data)

Notice in the line above how we can call multiple commands on the same line simply by putting a semicolon between them.

Boxplots

Histograms are great for describing the distribution of a quantity's values in a simple and informative manner yet most of the times we need to work with comparisons. That is we need ways to plot distributions side by side and compare the distributions visually, in a direct way. R has many variations for this that are all based on boxplots, or "box-and-whiskers" plots. Boxplots may be seen as visual summarizations of a distribution of values in the way that the summary() function captures some main attributes of a vector in numbers. Remember how the summary() function when called on a vector returns the min and max values alongside the values for the lower (25%) and the upper (75%) quartiles as well as the mean and the median values. Boxplot partly builds on these values creating a structure where a) a main box is drawn around the lower and upper quartiles having this as boundaries and with an inner line corresponding to the median and b) extends "whiskers", lines that reach up and down to values equalling the interquartile range (IQR) (that is the distance between 75%-25% percentiles, corresponding to the middle 50% of the values) multiplied by 1.5 c) plots dots for values that fall beyond the ± 1.5IQR which we call outliers. Here's a simple example of a boxplot.

x<-rnorm(1000,mean=5, sd=0.5) boxplot(x)

In the first command we create a random vector x which is sample from a normal distribution with mean=5 and standard deviation=0.5. Then we call the boxplot on this vector and see how the box is roughly centered around 5 with "whiskers" extending to values further outwards. Some values are randomly scattered outside the whiskers and are plotted as dots. (Notice that calling rnorm is random so the number and positions of outliers may change every time you call it. The same slightly holds for most of the boxplots attributes).

Boxplots can also be plotter horizontally in which case it may be nice to couple them with rug

x<-rnorm(1000,mean=5, sd=0.5) boxplot(x, horizontal=T);rug(x)

Boxplots can be modified in many ways accepting a number of options including color (col=) and line width (lwd=) that modifies there appearance. Two interesting features that are also quite informative are (varwidth=) and (notch=) both of which accept logical values (TRUE/FALSE). Setting varwidth to TRUE modifies the width of the boxplot by a factor relative to the number of the observations in the vector. In this way we can visualize differences in the sizes of the vectors compared. For instance consider

x1<-rnorm(1000, mean=5, sd=0.5)

x2<-rnorm(100, mean=5, sd=0.5)

Ivn this case x1 and x2 are sampled from the same distribution but the sample size differs by an order of magnitude. Visualization with boxplot with the varwidth option set to TRUE will optically give us this information

boxplot(x1, x2, varwidth=T, col="grey", lwd=3)

One last attribute of boxplots that is worth mentioning for boxplots is the notch argument. notch=T draws a notch (like a bowtie) around the median of the boxplot to an extent relative to the half of the IQR. In this way it gives a rough idea of the area in which the data have the greater density. Although you cannot simply count on notches you can say that two distributions' means are different if the notches of their corresponding boxplots do not overlap

x1<-rnorm(1000, mean=5, sd=0.5)

x2<-rnorm(1000, mean=5, sd=1)

boxplot(x1, x2, col="grey", notch=T, lwd=3)

In case you want to know more about the data behind a boxplot, R has a nice function called boxplot.stats. When called on a vector it returns all the information on the number of observations n, the values of the quartiles and min, max values alongside the confidence intervals and the exact location of outliers (out).

vioplot/beanplot

Vioplot() and beanplot() are alternatives to boxplots. Both come with additional packages so they will need prealoading. They produce a box-like plot without whiskers that mimics the shape of a histogram. Beanplot in particular has useful options that allow you to see the quartiles and a rug of the values visualized on the same plot. Beanplot can be loaded by first installing the package

install.packages("beanplot)

In the window that pops-up you should choose the location closer to you. After installation is finished you need to load the library with

library("beanplot")

This is basically the way we import new functions and packages to R. But wait for more in the Programming R section about this and other uses of the extensive R repository.

Now you are ready to use beanplots like this:

x1<-rnorm(1000, mean=1, sd=1)

x2<-rnorm(1000, mean=0, sd=1)

beanplot(x1,x2, what=c(1,1,1,1), col=c("black", "red", "blue","light grey")

What you get is a plot that has 4 main features for every vector. These are the ones corresponding to the 4 1s in the what=() option and they are (in this order): 1: a dotted line traversing the plot that corresponds to the mean (or the median, this can be set with the overallline="median" option), 2: the beanplot itself, 3: the mean of each of the beanplots (vectors) this can also be set to refer to the median or even the quantiles with beanlines="median" or beanlines="quantiles", 4: a rug of the values of each vector. The colouring also follows as a vector of 4 elements, with each color corresponding to each of the five prementioned features.

Three-dimensional data with persp and filled.contour

Plotting data in three dimensions is not so common and in many cases not very practical either. Still, sometimes we would like to see how a variable z fluctuates against two others x and y. R has a number of functions for this purpose, which may be mainly divided in two categories: a) "volume" plots like the ones produced by persp() and b) "colour-coded" plots like the ones produced with contour(). Both persp() and contour() receive similar arguments, that is three objects, the first two are two vectors (x,y) set a "grid" or the main surface. The third object z has to be a matrix of dimensions [x, y] that represents some sort of "height" for each 2D coordinate of the grid. It is impossible to plot the data if these conditions are not fulfilled. For starters, lets begin by creating these objects:

x<-seq(from=1, to=10, by=1)

y<-seq(from=1, to=100, by=10)

We have now created two vectors x and y with equal lengths (1 to 10 by 1 and 1 to 100 by 10). Thus the dimensions of the grid are 10x10. We should now have the data z in a matrix with the same dimensions. We can create this matrix by setting first all values equal to zero 0 like this:

z<-matrix(0, nrow=10, ncol=10)

The command above tells R to create a matrix with 10 rows and 10 columns and setting each of the 10x10 elements equal to 0. Inspect the z matrix by typing z and see for yourselves. Now what we need is to fill in the matrix with numbers. I will simply create a list of 100 random numbers sampling from a uniform distribution

data<-runif(100, min=1, max=100)

in fact I will make it a bit more structured by ordering the random 100 numbers from the smallest to the largest. I can do this with the use of the sort() function. Simply call on sort() on data and then feed the output to matrix

sort(data)->temp

z<-matrix(temp, nrow=10, ncol=10)

We are now in line for producing a simple 3D plot with persp like this

persp(x,y,z, phi=15, theta=30, col="lightblue4", main="My 3D plot", xlab="some x", ylab="some y", zlab="a sorted random z")

Take a minute to see the grammar of persp. The first three arguments are x, y and z in this order. It is crucial that it is this order since otherwise the dimensions of the vectors/matrix will not fit in the function. Next, two values for phi= and theta= are related to the viewpoint of the plot. phi and theta are the angles to which the plot (which is a 3D volume) is tilted. Modify these accordingly to get different (and better) views of your plots. The color and axis labels that follow you are (should be) by now very aware of.

Contour and its companion filled.contour() may be also used for the visualization of 3D data. The difference from persp is that this is not a volume representation but a colour-code one. That is, R takes the values of the matrix (here z) and instead of plotting them as heights in a 3D plot, it transforms them to a detailed colour-code (something like red being the highest and blue the lowest). In this sense it is important that a palette of colors (a series of different colours or colour gradients) are supplied to the function. One example may be:

filled.contour(x,y,z, color.palette=rainbow, main="My Contour plot", xlab="some x", ylab="some y")

Inspect the created plot. It is basically a 2D plot with x and y on the two axes but the grid is colour-coded with a certain degree of resolution (depending on the dimensions of z) the legend for which is given right next to the plot. The main argument to keep an eye on here is color.palette, which takes as values series of colors that can either be pre-built in R or created by ourselves. Here I chose to use a pre-built array of colors called rainbow (self explanatory), other such may be heat.colors, topo.colors etc. Later on we will see how we can build our own palettes.

Saving and Printing plots

Last but not least we have reached the point where are plots look awesome but we can only sit and stare them in our screen. Is there a way to extract them as files, store them in our computer and share them with friends of colleagues? The answer is obviously yes. R employs a simple routine to export plots in various image formats such as *jpeg,*png, *ps, *tiff, or *pdf. The names of the functions are (again) self-explanatory. The grammar is the same in all of them and follows the simple structure

nameoffunction("filename"); plotting commands; dev.off()

Where the name of function is one of png, pdf, tiff e.t.c. and the "filename" is the name that the file will be given in the folder that you are currently working. This command opens a "writer device", an environment that accepts everyting printed in R and stores it in a filename with "filename". While providing plotting commands with the "device" on (that is having called one of the printing commands tiff, pdf e.t.c.) you will not be able to see anything on your screen. This means that the output of the plotting commands has been re-directed from the screen to the file. Once you are done, you need to finalize the file and quit the writer device. This is what the dev.off() function does. It simply tells R to print whatever has been plotted to the file with "filename" finalize it and exit the device. Then, once again printing of plots will be redirected to the screen. An example printing of a plot to a pdf file may be:

pdf("contour.pdf")

filled.contour(x,y,z, color.palette=rainbow, main="My Contour plot", xlab="some x", ylab="some y", key.title="ordered random z")

dev.off()

Notice how it also works with the commands being issued in different lines.

All printing functions take a series of arguments related to the size of the file (both its dimensions and its size in memory), the desired resolution, compression type (for tiff files). These differ depending on the function (and so on the format of the file to be saved). Make sure you get your options right.