INTRODUCTION TO BASIC COMPUTING USING R
The R language is a freely-available, object-oriented language for programming and statistical analysis. In lab we will give you a “tour” through some of the main features of R that are most relevant to the course. However, this is by no means a comprehensive coverage.
All of the files referred to below are (or will be ) available via linked downloads. You should download script and data files as needed to your local (laptop or desktop) working directory as needed.
Basic R language and syntax
When we say R is object oriented what we mean is that everything that R does or contains is based on objects. Objects in R include numerical values; letters or other symbols used to stand for numerical values; entire data arrays or other structures; functions; and the results of statistical or other calculations. Object orientation lends itself to a hierarchical way of thinking about data and problems that, while extremely powerful in practice, can initially be confusing to those used to a more “traditional” format. Initially the structure of objects will be very simple and straightforward; as we get more familiar with R and get into more complex problem the power of the object-oriented approach will hopefully become clearer.
Obtaining, installing, and running R
R can be obtained for free from the CRAN site http://cran.r-project.org/
where you will find files that can be downloaded and installed as appropriate for your system (e.g., Windows XP or 7, Mac). Installation files can also be downloaded by a third party, placed on a memory stick or CD, and installed from these devices, in case you do not have access to the internet.
Once installed, R is run simply by clicking on the R icon (or selecting R from the program task bar) to open the R-Gui console screen. This screen will have a series of menu items followed by a blank window with a
<
on the left margin. The “<” is a prompt for R commands, which can be entered directly into the console, or as we shall see below, via script files.
R can also be run from within the R studio framework, which provides a nice Windows-like environment for managing script files, data, and other objects. You can download and install R studio here.
Basic R commands
There are a large number of commands in R, but to get started we will use a few essential ones.
First, if we enter a value (say the number 1) on the prompt line and it return, we get a result: namely, we get back what we entered:
> 1
[1] 1
Second, that many familiar operations such as addition, subtraction, multiplication and division have obvious R counterparts. Furthermore, R follows the usual rules of operation priority. We can illustrate these with a few simple examples
The line starting in [1] indicates that this is a result following the input (>).
> 1+1
[1] 2
> 1
[1] 1
> 1+1+2
[1] 4
> 1+2*7
[1] 15
> 3/2+4+8
[1] 13.5
> 3/2+4+8*3
[1] 29.5
> 3/2+(4+8)*3
[1] 37.5
> 4-5/3+4.4*(7+6)
[1] 59.53333
>
Many mathematical functions are available in R, as illustrated by a few examples. Note that the function “log” refers to “base e or natural logarithm”; if you need base 10 logs then the appropriate function is log10.
> log(2)
[1] 0.6931472
> exp(.6931472)
[1] 2
> log10(100)
[1] 2
> 10**2
[1] 100
> sqrt(16)
[1] 4
> pi
[1] 3.141593
> sin(pi)
[1] 1.224606e-16
> cos(pi)
[1] -1
> tan(pi)
[1] -1.224606e-16
> abs(-5)
[1] 5
> factorial(10)
[1] 3628800
>
Assignment and relational operators
We have already introduced several operators above, namely the arithmetic operators +, -, * , / and ** for addition, subtraction, multiplication, division, and exponentiation. Additional arithmetic operators include ^ , which is the same as ** for exponentiation.
An assignment operator assigns a value or a result of an operation to an object. The standard R assignment operator is <- (less than followed by minus). So for example
> a<-2
assigns the value 2 to the object a. Notice that the operator works in either direction, as long as what is being assigned (in this case 2) is on the “-“ end and the object that it is assigned to (a in this case) is on the “<” side. So,
> 2->a
is equivalent to the above. Finally, notice that it makes no sense to assign values to 2, so that
a->2
Error in 2 <- a : invalid (do_set) left-hand side to assignment
>
returns an error. The operator “=” performs like <- so that “a=2” makes sense but “2=a” does not. In general it is preferable to use the -> or <- operators instead of = because the direction of the arrow makes clear the origin and destination of assignment, and to avoid confusion with the ‘is equal to’ comparison operator below.
Note that assignment is not a permanent, immutable condition. For example above we assigned
> a<-2
If we now assign
> a<-100
then 100 replaces 2 as the value for a. This is important and can be a cause for errors, for the opposite reason: the object a now retains the value 100 until we assign it something else. If we now use the object a in another computation, the value of 100 will go with it. So if we (say later in the R session or in a new session where we have not erased our workspace (see below) from the previous session, we invoke
>a*2
we should not be surprised to see “200” returned, since a’s value is currently 100.
Relational operators work differently, in that rather than assigning a value to an object, they compare 2 or more objects. Relational operators return a value of TRUE (or 1) if the comparison is true and FALSE or 0 if it is not true. The standard relational operators are >, >=, <, <=, ==, and !=. All of these except the last 2 should be obvious (but are illustrated below). The last 2 are the equality comparison (as distinct from assignment (read ‘is equal to’) and the inequality comparison (read “not equal to’). Note that certain combinations of these comparisons can be true
> a=2
> b=3
> 2<5
[1] TRUE
> a>b
[1] FALSE
> a<=b
[1] TRUE
> a==b
[1] FALSE
> a<b
[1] TRUE
> a!=b
[1] TRUE
> b!=a
[1] TRUE
>
Getting help in R
Before getting too far into the R weeds, we want to point out that, while R is relatively easy to use and powerful, everyone – and that includes us- gets stuck and needs help on occasion. Fortunately, there is a large and growing community of support for R, and many ways to get help quickly.
There are a number of references, both online and traditional print, that describe R in detail, and we recommend several below. Some “online” help in built directly into R, either in local html files provided with the R installation, or by reference to the R site. Finally, because R is used by a large community of practice worldwide, much information is available simply by entering a sufficiently descriptive question into an internet search engine.
If you have started the R-Gui and are in the R-console, help is available by selecting “help” from the main menu. Here, you will be routed to a number of sub-menus, including some manuals in Adobe format, as well has html-format help and a “search for” help that allows you to insert a question and search for guidance. If you have a question about a particular R command or function, often help can be obtained by typing ? followed by the function name (no spaces) on the command line. For example, suppose you want to know how the built-in function mean works. The command
>?mean
will send you to an html help page that fully explains the function and provides examples.
If you don’t know how a command is spelled or whether it is capitalized or not (R is case-sensitive!) then you can try a more generic search. For example
>??Mean
would list a large number of functions and procedures related to “mean”, including the specific one called mean.
Command line vs. script
In all of the above examples, we entered R commands after the < prompt, one command at a time. An equivalent (and often more efficient) approach is write a series of commands and save them to a script file (usually this will be a file with a .R or .txt extension, which can be edited like a text file). Script files can be created using the R gui (“New Script File”) or any standard text editor (like Notepade). The file commands can then be submitted to R using the R-gui, either all at once or by selecting a range of lines to submit. Most of our later, more complicated R examples will involve script files.
R workspace
R saves commands, object values/ attributes in a section of computer memory called “workspace”. When you exit and R you will be asked “Do you want to save workspace
image.” Usually you should respond (type) “yes”. Then the next time you open R (assuming you are operating on the same compute you ran R on last time)you will get the statement
[Previously saved workspace restored]
Many things (commands, data objects) that you created the last session
will still be there. Often this is a desirable feature, but occasionally can be confusing, particularly if you use different objects with similar names between sessions, since the “old” values will still be there. If you do not wish to save the workspace, you can type “no” when prompted. You can also be sure that the workspace is empty by typing the command
>rm(list=ls())
before resuming with new commands; this will remove (delete) any commands or objects that may still be in memory. On the other hand, if you have just loaded and manipulated a complex data set and need to temporarily quit the session, it may be most convenient to save the workspace, in which case you can essentially pick up where you left off.
Tips for running R in Windows
Setting the working directory
One of the most aggravating things you will run into is having to tell R each time you run it where your data files are. You don’t want to hunt for these files every time your run R. By assigning a working directory for your session you tell R where to look for files (and where to write output) without having to assign the full file path each time. To
illustrate, suppose I want to use the directory
C:/R_stuff
as my working directory (note: I am assuming that you have already created a directory that will be your working directory and perhaps already placed files there. If not, you
need to). If I am running R from the R gui, I can easily set the working directory to this location by going to “File” then “Change dir” and then navigating to the desired location. The same results can be achieve by typing the setwd(“directory name”) command, for this example
setwd("C:/R_stuff")
This will now be the location where R expects to see input and will write output unless specifically directed otherwise.
Both of the above require commands at each R session to set the working directory. You instead may wish to change the default directory that R opens in to your working directory. This is done by going to the Windows task bar, right clicking on the R icon, selecting Properties, and changing the “Start In” directory as desired.
Note that the working directory can also be set (without script commands) from within R studio, by browsing to the desired directory or (if a script file is already open) specifying that the working directory is the the same as the script file.
Data input and output
Getting data into R
We have already seen very simple of how to input data values into R, using the R console or equivalently script line commands. Again, for example
>a<-2
sets the value of the object a as 2. A more complicated example occurs when we have a list of values, also known as a vector. So for example suppose that we observe 10 sample values (say, bird point counts). We can define list object as
>counts=c(4,3,4,5,7,2,3,4,5,6)
where “c()” is a collection operator that tells R to expect a list of numbers.
If we type “counts” on the command line, we get the 10 values displayed, separated by spaces.
> counts
[1] 4 3 4 5 7 2 3 4 5 6
A dataframe is an R object in which the data entries are in rows and columns, with the rows being individual objects and the columns being the different variables that are recorded for each observation. Dataframes can be built interactively, or “coerced” (converted) from other objects. It will often be convenient to read data from an external file (or files) and convert them into a dataframe.
We can illustrate both the use of external files and the creation of dataframes by a simple elaboration of the point count example. Suppose that in addition to just recording the number of birds at each point, we also record the forest type and the elevation (m). Our example data now look like this.
and are saved in this column format in a comma-delimited text or csv file; csv files are easily created using Excel or other spreadsheet program, or as output from most common data management programs (dBase, Access, etc.).
The command
>counts<-read.table("counts.csv",header=T,sep=",")
will read the data from the csv file. The specification header=T tells R to expect the first line of the file to be a line designating the variable name; sep=”,” tells R that the column values are separated by commas (versus spaces or other characters). Entering the command “counts” confirms that the data were entered in properly. Finally the command
>attach(counts)
will make the variable names and data structure available to the current R session. Dataframes will be the standard way we will input and manipulate data in R, and are readily used by most of the programs we will use for analysis. Finally, the attached script file saves the R commands used to read the data and create the data frame, and also produces some simple summary statistics that we’ll return to.
Saving data/ results
There are a number of ways to save data, computations, and other results from R sessions. We’ve already seen how R objects can be created, and can be referred to in later R sessions if the workspace is saved and restored. There are many other formats under which data can be saved using R, but we will focus on simple formats similar to the ones we just used to read in data: the delimited text file. The command “write.table” or even more simply “write.csv” will take a dataframe as input and produce a comma-delimited text file as output. For example the command
>write.csv(counts,file='countsnew.csv')
will take the current dataframe counts and write it to a new file (countnew.csv), preserving the column headers and the comma separated format. Of course we aren’t usually going to care about simply cloning copies of the data this way, so a more practical use occurs when we create new computation in our program and wish to save the results. So for example, suppose that we wished to create a new variable called elev_ft, in which we convert elevation from meters to feed. We could create the new variable and include it in the revised output file by the commands
>counts$elev_ft<-elevation*3.28
>write.csv(counts,file='countsnew.csv')
By default write.csv (and write.table) will write new data to the specified file name, so if data are present these will be overwritten. Data can be appended to existing data by using the write.table command and the “append=TRUE” option. The write.table command also permits different types of delimiters and options for headers. We will use write.table or other approaches for writing data to files as these more specific needs arise.
Summarizing and manipulating data
Here we will use a slightly larger data example to illustrate some basic methods for summarizing and manipulating data. The data example involves 101 samples of insect counts along an elevation gradient from 0 to 1000m in 10-m intervals; the data are contained in a csv file. We read the data into R using the commands
>insects<-read.table("insects.csv",header=T,sep=",")
>attach(insects)
First, we can obtain some simple summary statistics either for all the variables at once, or for a single variable at time. For example
>summary(insects)
provides
plot elevation count
Min. : 1 Min. : 0 Min. : 156
1st Qu.: 26 1st Qu.: 250 1st Qu.: 528
Median : 51 Median : 500 Median : 1856
Mean : 51 Mean : 500 Mean : 4452
3rd Qu.: 76 3rd Qu.: 750 3rd Qu.: 6390
Max. :101 Max. :1000 Max. :21991
> mean(elevation)
[1] 500
> sd(elevation)
[1] 293.0017
> range(elevation)
[1] 0 1000
> mean(count)
[1] 4451.911
> sd(count)
[1] 5572.118
> range(count)
[1] 156 21991
>
A variety of other summary statistics such as median, mode, percentiles, and others are obtained similarly. For example
>quantile(count,c(.01,.5,.9))
provides the 1%, 50%, and 90% quantiles (percentiles) of the variable ‘count’.
Likewise,
>median(count)
provides the median value for count, which you can confirm is the same as the 50%th percentile.
All of the above commands are saved in an R script file (NOTE: rather than using "attach" to reference the dataframe, I have used the with() function or in some cases the data= argument in the statistical function. I recommend these approaches over the "attach" command, which can create confusion particularly if multiple data frames are in use that may have conflicting variable names.
Basics of graphing
Graphing is a very important feature of R, and one that is very useful at both the early exploratory stages of analysis and later when we need to do model diagnostics. While there are many types of graphs possible in R, 2 of the most common and useful ones are histograms and scatterplots. Histograms graphically show the relative frequencies of values of a variable. So for example the insect data above has 2 variables that are recorded: elevation and count. We can produce histograms for this by the commands
>hist(elevation)
>hist(count)
which produces the graphs
and
Thus we can quickly see that the data is pretty evenly spread over the range of elevations, but that most of the data falls below counts of 5000, with a very small number >2000.
Scatterplots, by contrast, plot one variable against another in an attempt to discern patterns or correlations. Using the insect example
>plot(elevation,count)
produces
indicating a strong negative relation between elevation and insect abundance. Many times we will want to transform the data to better display relationships or meet model assumptions. For example for count data it is common to use a logarithmic transformation, so that
>plot(elevation,log(count))
produces a plot but this time with count on the natural log scale.
Writing functions in R
Although R has many built-in functions that will perform a wide variety of tasks, sooner or later most R users will want to write their own functions in R. Essentially a function is just a small program that produces a defined output given a specified input. To take a very simple example, supposed we want to write our own function to compute the sample mean. We do this by writing a new object called (for example) my_mean, which is now a function.
my_mean<-function(x){sum(x)/length(x)}
The input to the function is a list of numbers x, and the function computes the mean by first computing the sum and then dividing by the number of elements (both from built-in functions). Notice that if we simple type the function on the input line (or input it from script) we just get the function description
> my_mean
function(x){sum(x)/length(x)}
To actually use the function we must provide inputs (x), so a list of numbers. For example,
> data<-c(1,2,3,4,5,5,6)
> my_mean(data)
[1] 3.714286
>
applies the value of a list (“data”) to the function and returns the mean.
More complex functions can be written that involve 2 or more inputs. For example the function
> my_fun<-function(x,y){x*2+log(y)}
produces the function
You confirm that the function produces correct results for sample data by performing these same calculations on a calculator, e.g.,
> my_fun(2,3)
[1] 5.098612
> 2*2+log(3)
See the attached script file.
Built in statistical functions and basic statistical analysis
As we have seen, R has a number of built in statistical functions for simple summary statistics (mean, standard deviation, median, mode, etc.). R is also capable of performing many (if not most or all) standard statistical tests. We will illustrate with an example from a 2-sample t-test using the insect count data. To do so we will pretend that the counts above and below mean elevation are based on a random sample of these 2 elevation strata (which of course they are not). To implement the t-test we first define a grouping variable based on elevation
>elev_levels<-(elevation>mean(elevation))*1
which produces values of 1 for elevations above mean elevation and zero otherwise, effectively creating 2 groups. We then use the grouping variable to model the counts by the statement
>t.test(data=insects,count~elev_levels)
Producing the output
Welch Two Sample t-test
data: count by factor
t = 9.3005, df = 50.629, p-value = 1.544e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
5906.382 9158.900
sample estimates:
mean in group 0 mean in group 1
8180.941 648.300
Analysis of variance (ANOVA) performs tests based on more than 2 groups or combinations of groups (factors). We can illustrate a simple (1-way) ANOVA with 3 groups by first creating a grouping factor based on thirds of elevation by a trick similar to the one above
>q<-quantile(elevation,c(.333,.666))
> thirds<-(elevation>q[1])+(elevation>q[2])
>thirds<-factor(thirds)
Here we create numerical levels associated with thirds of elevation, and then inform R that these are factor levels (not continuous levels).
The command
>insect_anova<-lm(formula = count ~ thirds)
creates a linear model with count as the response and “thirds” as the predictor, and assigns this to the object “insect_anova”
Finally, the command
> anova(insect_anova)
creates and displays the actual ANOVA table (contained as an attribute within the object)
Analysis of Variance Table
Response: count
Df Sum Sq Mean Sq F value Pr(>F)
thirds 2 2153614853 1076807426 110.94 < 2.2e-16 ***
Residuals 98 951235448 9706484
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
More advanced statistical models
More generally, we can use linear models and the lm procedure to fit models between a response such as insect counts and either grouping or continuous variables. To illustrate the latter, consider the relationship between insect counts and elevation, now treated as a continuous measure.
> insect_regression<-lm(formula = count ~ elevation)
> summary(insect_regression)
Call:
lm(formula = count ~ elevation)
Residuals:
Min 1Q Median 3Q Max
-3187.4 -2426.0 -685.7 1862.7 9391.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12599.9918 570.2368 22.10 <2e-16 ***
elevation -16.2962 0.9852 -16.54 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2887 on 99 degrees of freedom
Multiple R-squared: 0.7343, Adjusted R-squared: 0.7316
F-statistic: 273.6 on 1 and 99 DF, p-value: < 2.2e-16
The above illustrates an example of simple linear regression.
A somewhat more sophisticated analysis is enabled using generalized linear regression and the R procedure glm. This procedure allows for specification of error distributions other than Normal (the only choice under lm) and standard transformations such as logarithm, logit, etc. For the insect example
insect_glm<-glm(formula = count ~ elevation,family=poisson(link="log"))
> summary(insect_glm)
Call:
glm(formula = count ~ elevation, family = poisson(link = "log"))
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6766 -0.6819 0.0493 0.6355 2.4080
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.000e+01 2.144e-03 4665.3 <2e-16 ***
elevation -4.990e-03 8.153e-06 -612.1 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 590583.172 on 100 degrees of freedom
Residual deviance: 84.113 on 99 degrees of freedom
AIC: 1031.6
Number of Fisher Scoring iterations: 3
******************************
Note that the glm procedure, in addition to allowing for a wide range of statistical distributions, also produces AIC values under maximum likelihood. We will discuss both distributions and AIC model comparison in the labs to come.
Generating data in R
We will often wish to generate sample data in R for a variety of reasons. The object-oriented nature of R makes it easy to generate large amounts of data quickly, as we will see.
Deterministic simulation
The first and simplest case to deal with involves deterministic simulation, in which the results are always the same given the same input and initial values. For example, suppose that we postulate that log insect density always decreases linearly with increasing elevation over the range 0 to 1000 m according to the model
log(N) = 10 - 0.005elevation
We can easily simulate data in R under this model using the following steps.
· Generate a list of elevations from 0 to 1000 m by
>elev<-0:1000
· Generate log(N) by
>log_N<-10-0.005*elev
· Obtain N as
>N<-exp(log_N)
· Round N to the nearest integer value
>N<-round(N)
Stochastic simulation
In stochastic simulation, at least some of the output is dependent on random variables, so that the simulation results are different from run to run (stochastically).
· Generate a list of elevations from 0 to 1000 m by
>elev<-0:1000
· Compute log(lambda) as
>log_lambda<-10-0.005*elev
· Obtain lambda as
>lambda<-exp(log_lambda)
· Generate random N’s from normal distribution with means lambda and sd=1000
>N<-rnorm(1001,lambda,sd=1000)
Notice that the first 3 steps are deterministic and simply establish the values of lambda (the mean of the normal distribution. Then given these values, N is generated stochastically. So in this example the first 3 steps would only have to be executed once, with the last step repeated for each random sample of N desired.
Sampling from data
In the above example, random values are generated from scratch based on a statistical distribution (in this case Poisson). Sometimes it is more appropriate to obtain random samples from a given (fixed) population and then use these samples to compute statistics. The “sample” function in R performs random sampling from a define population (list) of values. To illustrate, take the first example, where we generated 1001 values of N deterministically.
>elev<-0:1000
>log_N<-10-0.005*elev
>N<-exp(log_N)
>N<-round(N)
Now, let’s take a sample of 100 of these.
>N_sample<-sample(N,100)
provides a sample (without replacement) of 100 from N, under a uniform sampling probability (every value of N is equally likely to be drawn). If we repeat this
>N_sample<-sample(N,100)
our second sample will be different.
R-script for the above simulation examples is saved here.
Additional examples (optional, if you feel you need more practice)
Read in data for nestling Black-throated Blue Warblers captured during the 2011 field season at Coweeta, and do the following:
· Summarize the data (means, quantiles, and ranges for all numeric values)
· Report the mean, variance, and range for DAY6MASS
· Plot the histogram for DAY6MASS
· Create a new dataset called nest2 that contains only observations where R_TARSUS>0
· HINT: Use the command subset to do this. For example “nest2<-subset(nestlings,R_TARSUS>0)”
· Using the new data set
· Report the mean, variance, and range for DAY6MASS and R_TARSUS
· Compute histograms for each variable
· Plot DAY6MASS versus R_TARSUS
Even more!
Using the stochastic simulation example (insect abundance along an elevation gradient):
1. Generate 1001 stochastic values of N
a. Calculate the population mean and variance (mean and variance over all 1001 observation)
b. Create new variable that is the difference between the population mean and the count at each observation
c. Write the data set (plot number, elevation, count, and count-mean) to a csv file
d. Correctly read the above output data set back into an R data frame
2. Take 5 samples each of 100 from the N you generated (above) and
a. Calculate the mean and variance from the sample
b. Compare the sample mean to the population mean and variance for each sample
c. For each sample, fit a linear model of elevation to count. Discuss how the estimates of model coefficients vary from sample to sample
3. Write and test an original R function containing either a list input or at least 2 input variables.
Additional References
Bolker, B.M. 2008. Ecological models and data in R. Princeton University Press.
Crawley, M.J. 2007. The R book. Wiley