Christoforos Nikolaou
December 2016
Getting Started In order to start using R, simply run R by selecting it from the list of programs in windows or type “R” in a terminal window (under Linux). Once into the R environment you can perform simple mathematical operations, import data and analyze them with built-in functions or functions contained in external packages that you can download and install very easily from specific package repositories.
Running R and setting working directory in Windows Choose R or R-studio from your programs. In the screen that comes up type
getwd()
## [1] "/home/christoforos/Dropbox/Teaching/My_BIO204"
The results that comes up on your screen is the directory you are currently working in. In case you want to change the working directory to e.g. your desktop you can use this something like this command
setwd("C:/user/Desktop")
Try keeping in mind the directory you are working in as all data will be fed in better this way and you will be in better control of things. Installing and Loading Packages Let’s see one such example of downloading and installing a package called MASS. MASS stands for Modern Applied Statistics with S/R and contains a number of useful functions including some example datasets. Installing a package is as easy as typing:
install.packages("MASS")
in your R environment. In the pop-up window that appears you may select “Greece” from the list of repositories and hit enter. Then wait while the package is being downloaded. Once any package is downloaded and before it may be used, it should be loaded into the running environment. This is something that needs to be done every time R is started and is performed with the use of the library() function:
library("MASS")
Now we have MASS installed and running in R. Lets look into an example dataset called “iris”. Simply type:
iris
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa ## 7 4.6 3.4 1.4 0.3 setosa ## 8 5.0 3.4 1.5 0.2 setosa ## 9 4.4 2.9 1.4 0.2 setosa ## 10 4.9 3.1 1.5 0.1 setosa ## 11 5.4 3.7 1.5 0.2 setosa ## 12 4.8 3.4 1.6 0.2 setosa ## 13 4.8 3.0 1.4 0.1 setosa ## 14 4.3 3.0 1.1 0.1 setosa ## 15 5.8 4.0 1.2 0.2 setosa ## 16 5.7 4.4 1.5 0.4 setosa ## 17 5.4 3.9 1.3 0.4 setosa ## 18 5.1 3.5 1.4 0.3 setosa ## 19 5.7 3.8 1.7 0.3 setosa ## 20 5.1 3.8 1.5 0.3 setosa ## 21 5.4 3.4 1.7 0.2 setosa ## 22 5.1 3.7 1.5 0.4 setosa ## 23 4.6 3.6 1.0 0.2 setosa ## 24 5.1 3.3 1.7 0.5 setosa ## 25 4.8 3.4 1.9 0.2 setosa ## 26 5.0 3.0 1.6 0.2 setosa ## 27 5.0 3.4 1.6 0.4 setosa ## 28 5.2 3.5 1.5 0.2 setosa ## 29 5.2 3.4 1.4 0.2 setosa ## 30 4.7 3.2 1.6 0.2 setosa ## 31 4.8 3.1 1.6 0.2 setosa ## 32 5.4 3.4 1.5 0.4 setosa ## 33 5.2 4.1 1.5 0.1 setosa ## 34 5.5 4.2 1.4 0.2 setosa ## 35 4.9 3.1 1.5 0.2 setosa ## 36 5.0 3.2 1.2 0.2 setosa ## 37 5.5 3.5 1.3 0.2 setosa ## 38 4.9 3.6 1.4 0.1 setosa ## 39 4.4 3.0 1.3 0.2 setosa ## 40 5.1 3.4 1.5 0.2 setosa ## 41 5.0 3.5 1.3 0.3 setosa ## 42 4.5 2.3 1.3 0.3 setosa ## 43 4.4 3.2 1.3 0.2 setosa ## 44 5.0 3.5 1.6 0.6 setosa ## 45 5.1 3.8 1.9 0.4 setosa ## 46 4.8 3.0 1.4 0.3 setosa ## 47 5.1 3.8 1.6 0.2 setosa ## 48 4.6 3.2 1.4 0.2 setosa ## 49 5.3 3.7 1.5 0.2 setosa ## 50 5.0 3.3 1.4 0.2 setosa ## 51 7.0 3.2 4.7 1.4 versicolor ## 52 6.4 3.2 4.5 1.5 versicolor ## 53 6.9 3.1 4.9 1.5 versicolor ## 54 5.5 2.3 4.0 1.3 versicolor ## 55 6.5 2.8 4.6 1.5 versicolor ## 56 5.7 2.8 4.5 1.3 versicolor ## 57 6.3 3.3 4.7 1.6 versicolor ## 58 4.9 2.4 3.3 1.0 versicolor ## 59 6.6 2.9 4.6 1.3 versicolor ## 60 5.2 2.7 3.9 1.4 versicolor ## 61 5.0 2.0 3.5 1.0 versicolor ## 62 5.9 3.0 4.2 1.5 versicolor ## 63 6.0 2.2 4.0 1.0 versicolor ## 64 6.1 2.9 4.7 1.4 versicolor ## 65 5.6 2.9 3.6 1.3 versicolor ## 66 6.7 3.1 4.4 1.4 versicolor ## 67 5.6 3.0 4.5 1.5 versicolor ## 68 5.8 2.7 4.1 1.0 versicolor ## 69 6.2 2.2 4.5 1.5 versicolor ## 70 5.6 2.5 3.9 1.1 versicolor ## 71 5.9 3.2 4.8 1.8 versicolor ## 72 6.1 2.8 4.0 1.3 versicolor ## 73 6.3 2.5 4.9 1.5 versicolor ## 74 6.1 2.8 4.7 1.2 versicolor ## 75 6.4 2.9 4.3 1.3 versicolor ## 76 6.6 3.0 4.4 1.4 versicolor ## 77 6.8 2.8 4.8 1.4 versicolor ## 78 6.7 3.0 5.0 1.7 versicolor ## 79 6.0 2.9 4.5 1.5 versicolor ## 80 5.7 2.6 3.5 1.0 versicolor ## 81 5.5 2.4 3.8 1.1 versicolor ## 82 5.5 2.4 3.7 1.0 versicolor ## 83 5.8 2.7 3.9 1.2 versicolor ## 84 6.0 2.7 5.1 1.6 versicolor ## 85 5.4 3.0 4.5 1.5 versicolor ## 86 6.0 3.4 4.5 1.6 versicolor ## 87 6.7 3.1 4.7 1.5 versicolor ## 88 6.3 2.3 4.4 1.3 versicolor ## 89 5.6 3.0 4.1 1.3 versicolor ## 90 5.5 2.5 4.0 1.3 versicolor ## 91 5.5 2.6 4.4 1.2 versicolor ## 92 6.1 3.0 4.6 1.4 versicolor ## 93 5.8 2.6 4.0 1.2 versicolor ## 94 5.0 2.3 3.3 1.0 versicolor ## 95 5.6 2.7 4.2 1.3 versicolor ## 96 5.7 3.0 4.2 1.2 versicolor ## 97 5.7 2.9 4.2 1.3 versicolor ## 98 6.2 2.9 4.3 1.3 versicolor ## 99 5.1 2.5 3.0 1.1 versicolor ## 100 5.7 2.8 4.1 1.3 versicolor ## 101 6.3 3.3 6.0 2.5 virginica ## 102 5.8 2.7 5.1 1.9 virginica ## 103 7.1 3.0 5.9 2.1 virginica ## 104 6.3 2.9 5.6 1.8 virginica ## 105 6.5 3.0 5.8 2.2 virginica ## 106 7.6 3.0 6.6 2.1 virginica ## 107 4.9 2.5 4.5 1.7 virginica ## 108 7.3 2.9 6.3 1.8 virginica ## 109 6.7 2.5 5.8 1.8 virginica ## 110 7.2 3.6 6.1 2.5 virginica ## 111 6.5 3.2 5.1 2.0 virginica ## 112 6.4 2.7 5.3 1.9 virginica ## 113 6.8 3.0 5.5 2.1 virginica ## 114 5.7 2.5 5.0 2.0 virginica ## 115 5.8 2.8 5.1 2.4 virginica ## 116 6.4 3.2 5.3 2.3 virginica ## 117 6.5 3.0 5.5 1.8 virginica ## 118 7.7 3.8 6.7 2.2 virginica ## 119 7.7 2.6 6.9 2.3 virginica ## 120 6.0 2.2 5.0 1.5 virginica ## 121 6.9 3.2 5.7 2.3 virginica ## 122 5.6 2.8 4.9 2.0 virginica ## 123 7.7 2.8 6.7 2.0 virginica ## 124 6.3 2.7 4.9 1.8 virginica ## 125 6.7 3.3 5.7 2.1 virginica ## 126 7.2 3.2 6.0 1.8 virginica ## 127 6.2 2.8 4.8 1.8 virginica ## 128 6.1 3.0 4.9 1.8 virginica ## 129 6.4 2.8 5.6 2.1 virginica ## 130 7.2 3.0 5.8 1.6 virginica ## 131 7.4 2.8 6.1 1.9 virginica ## 132 7.9 3.8 6.4 2.0 virginica ## 133 6.4 2.8 5.6 2.2 virginica ## 134 6.3 2.8 5.1 1.5 virginica ## 135 6.1 2.6 5.6 1.4 virginica ## 136 7.7 3.0 6.1 2.3 virginica ## 137 6.3 3.4 5.6 2.4 virginica ## 138 6.4 3.1 5.5 1.8 virginica ## 139 6.0 3.0 4.8 1.8 virginica ## 140 6.9 3.1 5.4 2.1 virginica ## 141 6.7 3.1 5.6 2.4 virginica ## 142 6.9 3.1 5.1 2.3 virginica ## 143 5.8 2.7 5.1 1.9 virginica ## 144 6.8 3.2 5.9 2.3 virginica ## 145 6.7 3.3 5.7 2.5 virginica ## 146 6.7 3.0 5.2 2.3 virginica ## 147 6.3 2.5 5.0 1.9 virginica ## 148 6.5 3.0 5.2 2.0 virginica ## 149 6.2 3.4 5.4 2.3 virginica ## 150 5.9 3.0 5.1 1.8 virginica
and see that an object is listed in your window. How can you find out the class of this object? Use the class() function:
class(iris)
## [1] "data.frame"
to see that this is a data.frame. The iris dataset contains measurements for morphological characteristics of plants belonging to different species. Use the str() function:
str(iris)
## 'data.frame': 150 obs. of 5 variables: ## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ... ## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ... ## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ... ## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ... ## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
to inspect the structure and components of this data.frame. str() also helps you find out how many characteristics are measured, what kind of data these are (numeric, character, logical etc) and how many different species are investigated. Can you tell all this by your str() calling?
Iris is a built-in dataset which we can use as an example for many R function but what if we wanted to analyze our own data in R? This can be done by importing your data into the environment, converting them into an R-compatible form/object and then proceed to the analysis. Importing and converting is done automatically with suitable functions in R but first the data need to be pre-formatted in a way that is readable. This is the basic ASCII text format (simple txt-like files that you could read with applications like notepad or textpad). You can import such files with functions like
read.table() # for generic tables read.delim() # for tables, whose lists are separated by various delimiters (characters like comma, tab, space etc)
or
read.csv() # for csv (comma-separated-values) format
Nonetheless most of the times, Windows users tend to have their data in some sort of spreadsheet file like MS-Office-Excel or LibreOffice-Calc. These can be pre-formatted by exporting them to csv format using the File-Export command in MS-Office or saving as “Text-CSV” in LibreOffice. You can try this yourself: 1. Go to this (https://sites.google.com/site/uocdataanalysis/sharedfiles/irisdataset.xls?attredirects=0&d=1) link and download this *xls (Excell) file to your computer. 2. Open it with a spreadsheet editor and save it as csv keeping the same name (just change the extension to .csv) 3. Open R and run:
myiris<-read.csv("irisdataset.csv")
What you have just done is load the iris dataset from an excell file to a data.frame called myiris, which is identical to the built-in iris dataset. Lets use this from now on.
Having used str() we now know that we have data on the width and the length of the sepals and pedals of 50 plants from three different species of Iris. Lets start making use of the full potential of R by obtaining some descriptive statistics. We can do this with summary():
summary(myiris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 ## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 ## Median :5.800 Median :3.000 Median :4.350 Median :1.300 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 ## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ## ## ##
Inspect the output. We have all the information we want for all numeric values, including the mean, the median, the maximum and minimum values etc. In case we wanted some more info on basic statistics we could ask for the standard deviation with sd(). This function however can only be called upon vectors and so we need to choose which vector we want to analyze. This is done with subsetting, that is by choosing a part of an object before calling a function. Our myiris data.frame has names for each column so we can use the $ operator with the name of the column to subset:
sd(myiris$Sepal.Length)
## [1] 0.8280661
will return the standard deviation of the first column entitled Sepal.Length.
Subsetting is very important in R because it allows complex analysis in meaningful splits of the data. In our case a meaningful such split is to analyze the data for each species separately. We would like to see for instance the data for Iris setosa and Iris versicolor separately. How can we do this? One very useful function is which(). It can be used in various contexts (lists, data.frames, vectors and on numerical or character classes) while it also allows combinations of conditions (e.g. we want our subset to fulfill this AND that condition or this OR that etc). which() is used like this:
which(myiris$Species=="setosa")->set which(myiris$Species=="versicolor")->vers
In the commands above we ask for a condition to be specified in the column of the myiris data.frame that is entitled species and thus extract the positions of the corresponding subsets of the data.frame. Remember that the output of which() is a vector containing the positions of the elements fulfilling the condition asked.
We now have a way to get the data separately for each species. In this sense
myiris$Sepal.Length[set]
## [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 ## [18] 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 ## [35] 4.9 5.0 5.5 4.9 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0
will bring back only the Sepal Lengths of the setosa species, while
myiris$Petal.Width[vers]
## [1] 1.4 1.5 1.5 1.3 1.5 1.3 1.6 1.0 1.3 1.4 1.0 1.5 1.0 1.4 1.3 1.4 1.5 ## [18] 1.0 1.5 1.1 1.8 1.3 1.5 1.2 1.3 1.4 1.4 1.7 1.5 1.0 1.1 1.0 1.2 1.6 ## [35] 1.5 1.6 1.5 1.3 1.3 1.3 1.2 1.4 1.2 1.0 1.3 1.2 1.3 1.3 1.1 1.3
will bring back only the Petal Widths of the versicolor species etc. Lets see now how we can visualize the data. We can use the simple plot() function to visualize two-dimensional data. If for instance we would like to see how the two sepal dimensions vary with each other in the setosa species we could invoke a simple plot() command like:
plot(myiris$Sepal.Length[set],myiris$Sepal.Width[set])
plot() however need not be always so simple and a number of attributes related to points, colors, names of axes etc can be modified. An example would be:
plot(myiris$Sepal.Length[set],myiris$Sepal.Width[set], type="p", pch=19, col="red", xlab="Sepal Length", ylab="Sepal Width", xlim=c(4,7), ylim=c(1,5));
where attributes of the plot are set accordingly for better aesthetic results. Consider now that we want to plot the same data for another species in the same plot for a direct comparison. We can do this with invoking a lines() function, right after a plot(). lines() takes the same arguments only without the general specifics having to do with axis names, sizes etc. Plotting of the same data for the versicolor species would be done with:
plot(myiris$Sepal.Length[set],myiris$Sepal.Width[set], type="p", pch=19, col="red", xlab="Sepal Length", ylab="Sepal Width", xlim=c(4,7), ylim=c(1,5)); lines(myiris$Sepal.Length[vers],myiris$Sepal.Width[vers], type="p", pch=19, col="dark green", xlab="Sepal Length", ylab="Sepal Width");
The plot would not be complete without a legend that tells us which set of points refer to what. legend() does the job:
plot(myiris$Sepal.Length[set],myiris$Sepal.Width[set], type="p", pch=19, col="red", xlab="Sepal Length", ylab="Sepal Width", xlim=c(4,7), ylim=c(1,5)); lines(myiris$Sepal.Length[vers],myiris$Sepal.Width[vers], type="p", pch=19, col="dark green", xlab="Sepal Length", ylab="Sepal Width"); legend("topright", c("I. setosa", "I. versicolor"), fill=c("red","dark green"), bty="n", cex=1.2)
The plot we have just created shows a clear association between sepal lengths and widths for both species. By association we mean that it looks as if the higher the width of a sepal the higher its length is expected to be. This implies a dependence of the two variables, or, in other words, it tells us that we may be able to predict (to some extent) one from the other. Dependences of this kind is the first thing we look for in data. The rigorous term is correlation, and the simplest form of a correlation is a linear relationship between two values. We can calculate the linear correlation coefficient by invoking R’s cor() function on two equally sized, aligned vectors. In our case:
cor(myiris$Sepal.Length[set],myiris$Sepal.Width[set])
## [1] 0.7425467
The command above will return one numerical value in the range between -1 (perfect negative (or anti-) correlation) and 1 (perfect correlation). The closest this value is to the extremes (-1 or 1) the strongest the association between the variables and the better we will be able to predict one from the other. Compare the value for the setosa sepals with the one for the petals
cor(myiris$Petal.Length[set],myiris$Petal.Width[set])
## [1] 0.33163
and notice how the correlation for the sepals is almost twice as high as the sepals one. What does this tell us? The cor() function need not be strictly invoked in pairs of vectors. It works equally fast for numerical matrices were it returns a matrix of all 1-to-1 correlations of the columns in the matrix. If we restrict the myiris data.frame to its numerical matrix myiris[,1:4] (remember the structural subsetting) we get:
which(myiris$Species=="virginica")->virg cor(myiris[virg,1:4])->virgCors
virgCors is a 4x4 matrix that contains the linear (Pearson) correlation coefficients for all the data of the virginica species. A quick way to inspect matrices like this one is by plotting them in a heatmap. An example of a heatmap called with the heatmap() function is:
heatmap(virgCors, cexRow=1.5, cexCol=1.5, mar=c(12,12), scale="none")
Notice how a) all the diagonal elements are very bright as the corresponding correlation coefficients are =1 and b) the part corresponding to the sepals is brighter than the one of the petals as it corresponds to higher correlation coefficients. Compare this image with the corresponding for the versicolor species:
which(myiris$Species=="versicolor")->vers cor(myiris[vers,1:4])->versCors heatmap(versCors, cexRow=1.5, cexCol=1.5, mar=c(12,12), scale="none")
And see that the pattern is relatively different. The questions is: how different? How can we say if two datasets differ?
Simple statistical analysis of a data set is the basis for what we call hypothesis testing. In this case we want to, say, test the hypothesis that the two datasets for setosa and versicolor are not the same, that they differ in some aspect. We will thus test the hypothesis that they are NOT different and try to see if we can disprove it using simple statistics. Before we go on with a simple statistics test, lets see if we can visually infer the result of our test. Suppose we want to test whether setosa and versicolor differ in the width of their sepals. We can directly visualize the distribution of their values with a boxplot() function call. Here’s an example
boxplot(myiris$Sepal.Width[set], myiris$Sepal.Width[vers], notch=T, col="lightblue", lwd=3, names=c("I. setosa", "I. versicolor"), main="Sepal Width")
boxplot() plots a summarized representation of the distribution of values for a vector, with the box extending for the middle 50% of the values and the “whiskers” extending up to the edges of the distribution. Simply by looking at the boxplot we can see that the distributions of the two species differ substantially. However, there is a more rigorous way to say so and this is by invoking a simple T-test statistic calculation. T-test is a statistical test that tests the hypothesis that the means of two distributions are equal. It can only be applied on normal distributions but for our purposes we can assume that this holds. We can call the t.test() function like this:
t.test(myiris$Sepal.Width[set], myiris$Sepal.Width[vers], paired=T)
## ## Paired t-test ## ## data: myiris$Sepal.Width[set] and myiris$Sepal.Width[vers] ## t = 8.8506, df = 49, p-value = 9.857e-12 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## 0.508597 0.807403 ## sample estimates: ## mean of the differences ## 0.658
out of the values it returns we are mostly interested in the one called p-value. The p-value corresponds to the probability that our initial hypothesis holds. In this case a very small p-value means that we can reject our initial hypothesis (the means of the two distributions are equal) and therefore we can say that their is a “statistically significant difference” between the two sets.
Now you can try the same sort of analysis for a comparison of two other species:
boxplot(myiris$Sepal.Width[virg], myiris$Sepal.Width[vers], notch=T, col="lightblue", lwd=3, names=c("I. virginica", "I. versicolor"), main="Sepal Width");
Try to see if there are differences in the t.tests that may be reflected in the boxplot.