1. R Practical #1. Introduction

R Practical #1

Christoforos Nikolaou

December 2016

Simple Graphs and Stats in R

Getting Started In order to start using R, simply run R by selecting it from the list of programs in windows or type “R” in a terminal window (under Linux). Once into the R environment you can perform simple mathematical operations, import data and analyze them with built-in functions or functions contained in external packages that you can download and install very easily from specific package repositories.

Running R and setting working directory in Windows Choose R or R-studio from your programs. In the screen that comes up type

getwd()

## [1] "/home/christoforos/Dropbox/Teaching/My_BIO204"

The results that comes up on your screen is the directory you are currently working in. In case you want to change the working directory to e.g. your desktop you can use this something like this command

setwd("C:/user/Desktop")

Try keeping in mind the directory you are working in as all data will be fed in better this way and you will be in better control of things. Installing and Loading Packages Let’s see one such example of downloading and installing a package called MASS. MASS stands for Modern Applied Statistics with S/R and contains a number of useful functions including some example datasets. Installing a package is as easy as typing:

install.packages("MASS")

in your R environment. In the pop-up window that appears you may select “Greece” from the list of repositories and hit enter. Then wait while the package is being downloaded. Once any package is downloaded and before it may be used, it should be loaded into the running environment. This is something that needs to be done every time R is started and is performed with the use of the library() function:

library("MASS")

A simple dataset

Now we have MASS installed and running in R. Lets look into an example dataset called “iris”. Simply type:

iris

##     Sepal.Length Sepal.Width Petal.Length Petal.Width    Species ## 1            5.1         3.5          1.4         0.2     setosa ## 2            4.9         3.0          1.4         0.2     setosa ## 3            4.7         3.2          1.3         0.2     setosa ## 4            4.6         3.1          1.5         0.2     setosa ## 5            5.0         3.6          1.4         0.2     setosa ## 6            5.4         3.9          1.7         0.4     setosa ## 7            4.6         3.4          1.4         0.3     setosa ## 8            5.0         3.4          1.5         0.2     setosa ## 9            4.4         2.9          1.4         0.2     setosa ## 10           4.9         3.1          1.5         0.1     setosa ## 11           5.4         3.7          1.5         0.2     setosa ## 12           4.8         3.4          1.6         0.2     setosa ## 13           4.8         3.0          1.4         0.1     setosa ## 14           4.3         3.0          1.1         0.1     setosa ## 15           5.8         4.0          1.2         0.2     setosa ## 16           5.7         4.4          1.5         0.4     setosa ## 17           5.4         3.9          1.3         0.4     setosa ## 18           5.1         3.5          1.4         0.3     setosa ## 19           5.7         3.8          1.7         0.3     setosa ## 20           5.1         3.8          1.5         0.3     setosa ## 21           5.4         3.4          1.7         0.2     setosa ## 22           5.1         3.7          1.5         0.4     setosa ## 23           4.6         3.6          1.0         0.2     setosa ## 24           5.1         3.3          1.7         0.5     setosa ## 25           4.8         3.4          1.9         0.2     setosa ## 26           5.0         3.0          1.6         0.2     setosa ## 27           5.0         3.4          1.6         0.4     setosa ## 28           5.2         3.5          1.5         0.2     setosa ## 29           5.2         3.4          1.4         0.2     setosa ## 30           4.7         3.2          1.6         0.2     setosa ## 31           4.8         3.1          1.6         0.2     setosa ## 32           5.4         3.4          1.5         0.4     setosa ## 33           5.2         4.1          1.5         0.1     setosa ## 34           5.5         4.2          1.4         0.2     setosa ## 35           4.9         3.1          1.5         0.2     setosa ## 36           5.0         3.2          1.2         0.2     setosa ## 37           5.5         3.5          1.3         0.2     setosa ## 38           4.9         3.6          1.4         0.1     setosa ## 39           4.4         3.0          1.3         0.2     setosa ## 40           5.1         3.4          1.5         0.2     setosa ## 41           5.0         3.5          1.3         0.3     setosa ## 42           4.5         2.3          1.3         0.3     setosa ## 43           4.4         3.2          1.3         0.2     setosa ## 44           5.0         3.5          1.6         0.6     setosa ## 45           5.1         3.8          1.9         0.4     setosa ## 46           4.8         3.0          1.4         0.3     setosa ## 47           5.1         3.8          1.6         0.2     setosa ## 48           4.6         3.2          1.4         0.2     setosa ## 49           5.3         3.7          1.5         0.2     setosa ## 50           5.0         3.3          1.4         0.2     setosa ## 51           7.0         3.2          4.7         1.4 versicolor ## 52           6.4         3.2          4.5         1.5 versicolor ## 53           6.9         3.1          4.9         1.5 versicolor ## 54           5.5         2.3          4.0         1.3 versicolor ## 55           6.5         2.8          4.6         1.5 versicolor ## 56           5.7         2.8          4.5         1.3 versicolor ## 57           6.3         3.3          4.7         1.6 versicolor ## 58           4.9         2.4          3.3         1.0 versicolor ## 59           6.6         2.9          4.6         1.3 versicolor ## 60           5.2         2.7          3.9         1.4 versicolor ## 61           5.0         2.0          3.5         1.0 versicolor ## 62           5.9         3.0          4.2         1.5 versicolor ## 63           6.0         2.2          4.0         1.0 versicolor ## 64           6.1         2.9          4.7         1.4 versicolor ## 65           5.6         2.9          3.6         1.3 versicolor ## 66           6.7         3.1          4.4         1.4 versicolor ## 67           5.6         3.0          4.5         1.5 versicolor ## 68           5.8         2.7          4.1         1.0 versicolor ## 69           6.2         2.2          4.5         1.5 versicolor ## 70           5.6         2.5          3.9         1.1 versicolor ## 71           5.9         3.2          4.8         1.8 versicolor ## 72           6.1         2.8          4.0         1.3 versicolor ## 73           6.3         2.5          4.9         1.5 versicolor ## 74           6.1         2.8          4.7         1.2 versicolor ## 75           6.4         2.9          4.3         1.3 versicolor ## 76           6.6         3.0          4.4         1.4 versicolor ## 77           6.8         2.8          4.8         1.4 versicolor ## 78           6.7         3.0          5.0         1.7 versicolor ## 79           6.0         2.9          4.5         1.5 versicolor ## 80           5.7         2.6          3.5         1.0 versicolor ## 81           5.5         2.4          3.8         1.1 versicolor ## 82           5.5         2.4          3.7         1.0 versicolor ## 83           5.8         2.7          3.9         1.2 versicolor ## 84           6.0         2.7          5.1         1.6 versicolor ## 85           5.4         3.0          4.5         1.5 versicolor ## 86           6.0         3.4          4.5         1.6 versicolor ## 87           6.7         3.1          4.7         1.5 versicolor ## 88           6.3         2.3          4.4         1.3 versicolor ## 89           5.6         3.0          4.1         1.3 versicolor ## 90           5.5         2.5          4.0         1.3 versicolor ## 91           5.5         2.6          4.4         1.2 versicolor ## 92           6.1         3.0          4.6         1.4 versicolor ## 93           5.8         2.6          4.0         1.2 versicolor ## 94           5.0         2.3          3.3         1.0 versicolor ## 95           5.6         2.7          4.2         1.3 versicolor ## 96           5.7         3.0          4.2         1.2 versicolor ## 97           5.7         2.9          4.2         1.3 versicolor ## 98           6.2         2.9          4.3         1.3 versicolor ## 99           5.1         2.5          3.0         1.1 versicolor ## 100          5.7         2.8          4.1         1.3 versicolor ## 101          6.3         3.3          6.0         2.5  virginica ## 102          5.8         2.7          5.1         1.9  virginica ## 103          7.1         3.0          5.9         2.1  virginica ## 104          6.3         2.9          5.6         1.8  virginica ## 105          6.5         3.0          5.8         2.2  virginica ## 106          7.6         3.0          6.6         2.1  virginica ## 107          4.9         2.5          4.5         1.7  virginica ## 108          7.3         2.9          6.3         1.8  virginica ## 109          6.7         2.5          5.8         1.8  virginica ## 110          7.2         3.6          6.1         2.5  virginica ## 111          6.5         3.2          5.1         2.0  virginica ## 112          6.4         2.7          5.3         1.9  virginica ## 113          6.8         3.0          5.5         2.1  virginica ## 114          5.7         2.5          5.0         2.0  virginica ## 115          5.8         2.8          5.1         2.4  virginica ## 116          6.4         3.2          5.3         2.3  virginica ## 117          6.5         3.0          5.5         1.8  virginica ## 118          7.7         3.8          6.7         2.2  virginica ## 119          7.7         2.6          6.9         2.3  virginica ## 120          6.0         2.2          5.0         1.5  virginica ## 121          6.9         3.2          5.7         2.3  virginica ## 122          5.6         2.8          4.9         2.0  virginica ## 123          7.7         2.8          6.7         2.0  virginica ## 124          6.3         2.7          4.9         1.8  virginica ## 125          6.7         3.3          5.7         2.1  virginica ## 126          7.2         3.2          6.0         1.8  virginica ## 127          6.2         2.8          4.8         1.8  virginica ## 128          6.1         3.0          4.9         1.8  virginica ## 129          6.4         2.8          5.6         2.1  virginica ## 130          7.2         3.0          5.8         1.6  virginica ## 131          7.4         2.8          6.1         1.9  virginica ## 132          7.9         3.8          6.4         2.0  virginica ## 133          6.4         2.8          5.6         2.2  virginica ## 134          6.3         2.8          5.1         1.5  virginica ## 135          6.1         2.6          5.6         1.4  virginica ## 136          7.7         3.0          6.1         2.3  virginica ## 137          6.3         3.4          5.6         2.4  virginica ## 138          6.4         3.1          5.5         1.8  virginica ## 139          6.0         3.0          4.8         1.8  virginica ## 140          6.9         3.1          5.4         2.1  virginica ## 141          6.7         3.1          5.6         2.4  virginica ## 142          6.9         3.1          5.1         2.3  virginica ## 143          5.8         2.7          5.1         1.9  virginica ## 144          6.8         3.2          5.9         2.3  virginica ## 145          6.7         3.3          5.7         2.5  virginica ## 146          6.7         3.0          5.2         2.3  virginica ## 147          6.3         2.5          5.0         1.9  virginica ## 148          6.5         3.0          5.2         2.0  virginica ## 149          6.2         3.4          5.4         2.3  virginica ## 150          5.9         3.0          5.1         1.8  virginica

and see that an object is listed in your window. How can you find out the class of this object? Use the class() function:

class(iris)

## [1] "data.frame"

to see that this is a data.frame. The iris dataset contains measurements for morphological characteristics of plants belonging to different species. Use the str() function:

str(iris)

## 'data.frame':    150 obs. of  5 variables: ##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

to inspect the structure and components of this data.frame. str() also helps you find out how many characteristics are measured, what kind of data these are (numeric, character, logical etc) and how many different species are investigated. Can you tell all this by your str() calling?

Importing your own data

Iris is a built-in dataset which we can use as an example for many R function but what if we wanted to analyze our own data in R? This can be done by importing your data into the environment, converting them into an R-compatible form/object and then proceed to the analysis. Importing and converting is done automatically with suitable functions in R but first the data need to be pre-formatted in a way that is readable. This is the basic ASCII text format (simple txt-like files that you could read with applications like notepad or textpad). You can import such files with functions like

read.table() # for generic tables read.delim() # for tables, whose lists are separated by various delimiters (characters like comma, tab, space etc)

read.csv() # for csv (comma-separated-values) format

Nonetheless most of the times, Windows users tend to have their data in some sort of spreadsheet file like MS-Office-Excel or LibreOffice-Calc. These can be pre-formatted by exporting them to csv format using the File-Export command in MS-Office or saving as “Text-CSV” in LibreOffice. You can try this yourself: 1. Go to this (https://sites.google.com/site/uocdataanalysis/sharedfiles/irisdataset.xls?attredirects=0&d=1) link and download this *xls (Excell) file to your computer. 2. Open it with a spreadsheet editor and save it as csv keeping the same name (just change the extension to .csv) 3. Open R and run:

myiris<-read.csv("irisdataset.csv")

What you have just done is load the iris dataset from an excell file to a data.frame called myiris, which is identical to the built-in iris dataset. Lets use this from now on.

Simple Statistics with R

Having used str() we now know that we have data on the width and the length of the sepals and pedals of 50 plants from three different species of Iris. Lets start making use of the full potential of R by obtaining some descriptive statistics. We can do this with summary():

summary(myiris)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width    ##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   ##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   ##  Median :5.800   Median :3.000   Median :4.350   Median :1.300   ##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199   ##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800   ##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500   ##        Species   ##  setosa    :50   ##  versicolor:50   ##  virginica :50   ##                  ##                  ##

Inspect the output. We have all the information we want for all numeric values, including the mean, the median, the maximum and minimum values etc. In case we wanted some more info on basic statistics we could ask for the standard deviation with sd(). This function however can only be called upon vectors and so we need to choose which vector we want to analyze. This is done with subsetting, that is by choosing a part of an object before calling a function. Our myiris data.frame has names for each column so we can use the $ operator with the name of the column to subset:

sd(myiris$Sepal.Length)

## [1] 0.8280661

will return the standard deviation of the first column entitled Sepal.Length.

More Subsetting

Subsetting is very important in R because it allows complex analysis in meaningful splits of the data. In our case a meaningful such split is to analyze the data for each species separately. We would like to see for instance the data for Iris setosa and Iris versicolor separately. How can we do this? One very useful function is which(). It can be used in various contexts (lists, data.frames, vectors and on numerical or character classes) while it also allows combinations of conditions (e.g. we want our subset to fulfill this AND that condition or this OR that etc). which() is used like this:

which(myiris$Species=="setosa")->set which(myiris$Species=="versicolor")->vers

In the commands above we ask for a condition to be specified in the column of the myiris data.frame that is entitled species and thus extract the positions of the corresponding subsets of the data.frame. Remember that the output of which() is a vector containing the positions of the elements fulfilling the condition asked.

Simple plotting. Scatterplots

We now have a way to get the data separately for each species. In this sense

myiris$Sepal.Length[set]

##  [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 ## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 ## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0

will bring back only the Sepal Lengths of the setosa species, while

myiris$Petal.Width[vers]

##  [1] 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1.0 1.3 1.4 1.0 1.5 1.0 1.4 1.3 1.4 1.5 ## [18] 1.0 1.5 1.1 1.8 1.3 1.5 1.2 1.3 1.4 1.4 1.7 1.5 1.0 1.1 1.0 1.2 1.6 ## [35] 1.5 1.6 1.5 1.3 1.3 1.3 1.2 1.4 1.2 1.0 1.3 1.2 1.3 1.3 1.1 1.3

will bring back only the Petal Widths of the versicolor species etc. Lets see now how we can visualize the data. We can use the simple plot() function to visualize two-dimensional data. If for instance we would like to see how the two sepal dimensions vary with each other in the setosa species we could invoke a simple plot() command like:

plot(myiris$Sepal.Length[set],myiris$Sepal.Width[set])

plot() however need not be always so simple and a number of attributes related to points, colors, names of axes etc can be modified. An example would be:

plot(myiris$Sepal.Length[set],myiris$Sepal.Width[set], type="p", pch=19, col="red", xlab="Sepal Length", ylab="Sepal Width", xlim=c(4,7), ylim=c(1,5));

where attributes of the plot are set accordingly for better aesthetic results. Consider now that we want to plot the same data for another species in the same plot for a direct comparison. We can do this with invoking a lines() function, right after a plot(). lines() takes the same arguments only without the general specifics having to do with axis names, sizes etc. Plotting of the same data for the versicolor species would be done with:

plot(myiris$Sepal.Length[set],myiris$Sepal.Width[set], type="p", pch=19, col="red", xlab="Sepal Length", ylab="Sepal Width", xlim=c(4,7), ylim=c(1,5)); lines(myiris$Sepal.Length[vers],myiris$Sepal.Width[vers], type="p", pch=19, col="dark green", xlab="Sepal Length", ylab="Sepal Width");

The plot would not be complete without a legend that tells us which set of points refer to what. legend() does the job:

plot(myiris$Sepal.Length[set],myiris$Sepal.Width[set], type="p", pch=19, col="red", xlab="Sepal Length", ylab="Sepal Width", xlim=c(4,7), ylim=c(1,5)); lines(myiris$Sepal.Length[vers],myiris$Sepal.Width[vers], type="p", pch=19, col="dark green", xlab="Sepal Length", ylab="Sepal Width"); legend("topright", c("I. setosa", "I. versicolor"), fill=c("red","dark green"), bty="n", cex=1.2)

Looking for dependence in data. Linear Correlation

The plot we have just created shows a clear association between sepal lengths and widths for both species. By association we mean that it looks as if the higher the width of a sepal the higher its length is expected to be. This implies a dependence of the two variables, or, in other words, it tells us that we may be able to predict (to some extent) one from the other. Dependences of this kind is the first thing we look for in data. The rigorous term is correlation, and the simplest form of a correlation is a linear relationship between two values. We can calculate the linear correlation coefficient by invoking R’s cor() function on two equally sized, aligned vectors. In our case:

cor(myiris$Sepal.Length[set],myiris$Sepal.Width[set])

## [1] 0.7425467

The command above will return one numerical value in the range between -1 (perfect negative (or anti-) correlation) and 1 (perfect correlation). The closest this value is to the extremes (-1 or 1) the strongest the association between the variables and the better we will be able to predict one from the other. Compare the value for the setosa sepals with the one for the petals

cor(myiris$Petal.Length[set],myiris$Petal.Width[set])

## [1] 0.33163

and notice how the correlation for the sepals is almost twice as high as the sepals one. What does this tell us? The cor() function need not be strictly invoked in pairs of vectors. It works equally fast for numerical matrices were it returns a matrix of all 1-to-1 correlations of the columns in the matrix. If we restrict the myiris data.frame to its numerical matrix myiris[,1:4] (remember the structural subsetting) we get:

which(myiris$Species=="virginica")->virg  cor(myiris[virg,1:4])->virgCors

virgCors is a 4x4 matrix that contains the linear (Pearson) correlation coefficients for all the data of the virginica species. A quick way to inspect matrices like this one is by plotting them in a heatmap. An example of a heatmap called with the heatmap() function is:

heatmap(virgCors, cexRow=1.5, cexCol=1.5, mar=c(12,12), scale="none")

Notice how a) all the diagonal elements are very bright as the corresponding correlation coefficients are =1 and b) the part corresponding to the sepals is brighter than the one of the petals as it corresponds to higher correlation coefficients. Compare this image with the corresponding for the versicolor species:

which(myiris$Species=="versicolor")->vers  cor(myiris[vers,1:4])->versCors heatmap(versCors, cexRow=1.5, cexCol=1.5, mar=c(12,12), scale="none")

And see that the pattern is relatively different. The questions is: how different? How can we say if two datasets differ?

Basic statistical inference. Boxplots and Simple Hypotheses testing.

Simple statistical analysis of a data set is the basis for what we call hypothesis testing. In this case we want to, say, test the hypothesis that the two datasets for setosa and versicolor are not the same, that they differ in some aspect. We will thus test the hypothesis that they are NOT different and try to see if we can disprove it using simple statistics. Before we go on with a simple statistics test, lets see if we can visually infer the result of our test. Suppose we want to test whether setosa and versicolor differ in the width of their sepals. We can directly visualize the distribution of their values with a boxplot() function call. Here’s an example

boxplot(myiris$Sepal.Width[set], myiris$Sepal.Width[vers], notch=T, col="lightblue", lwd=3, names=c("I. setosa", "I. versicolor"), main="Sepal Width")

boxplot() plots a summarized representation of the distribution of values for a vector, with the box extending for the middle 50% of the values and the “whiskers” extending up to the edges of the distribution. Simply by looking at the boxplot we can see that the distributions of the two species differ substantially. However, there is a more rigorous way to say so and this is by invoking a simple T-test statistic calculation. T-test is a statistical test that tests the hypothesis that the means of two distributions are equal. It can only be applied on normal distributions but for our purposes we can assume that this holds. We can call the t.test() function like this:

t.test(myiris$Sepal.Width[set], myiris$Sepal.Width[vers], paired=T)

##  ##  Paired t-test ##  ## data:  myiris$Sepal.Width[set] and myiris$Sepal.Width[vers] ## t = 8.8506, df = 49, p-value = 9.857e-12 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ##  0.508597 0.807403 ## sample estimates: ## mean of the differences  ##                   0.658

out of the values it returns we are mostly interested in the one called p-value. The p-value corresponds to the probability that our initial hypothesis holds. In this case a very small p-value means that we can reject our initial hypothesis (the means of the two distributions are equal) and therefore we can say that their is a “statistically significant difference” between the two sets.

Now you can try the same sort of analysis for a comparison of two other species:

boxplot(myiris$Sepal.Width[virg], myiris$Sepal.Width[vers], notch=T, col="lightblue", lwd=3, names=c("I. virginica", "I. versicolor"), main="Sepal Width");

Try to see if there are differences in the t.tests that may be reflected in the boxplot.

Questions

Create a scatterplot (using plot() and lines()) that will contain the Petal Lengths (x-axis) vs Petal Widths(y-axis) of all three species in different colors. IMPORTANT: Set the axis limits carefully with xlim=c(…) and ylim=c(…) so that all values are included. Which of the species look similar? Which seem different?
Create two boxplots of the same measurements (Petal Length and Width). Try to see which of the species appears to be significantly different from the rest looking at the boxes.
From the boxplots you created in 2 take the pairs of datasets that look significantly different and compare them with the t.test function. Report the p-values and the confidence intervals for each comparison that you make.
Discuss your results. How are the three species different or similar in terms of Petal size?

Google Sites

Report abuse