R basics

Install R on your computer from http://www.r-project.org/ Windows: click CRAN, choose a “mirror site,” usually the country where you are; click “windows” and “base.” Mac: choose the version that matches your operating system version. Linux: do I need to tell?

A working session

(1) Open R. Set the working directory Windows: on menu bar, File > “Change dir”. Mac: under Misc. Linux: set the directory before opening R.

(2) Usually, you will (i) import (or create) data, (ii) compute a couple things and/or make images, and (iii) save your results. Below, you’ll find scripts on how to do this.

(3) Close with: quit() R will now ask you if you want to save the “workspace image”; don’t save it, since it will create problems the next time. Just in case, remove the previous working session before starting to do anything by typing remove(list=ls())

Elementary. Try all commands below.

2+3 # write comments after a hashtag

2*3 # with spaces, 2 * 3, it also works

2/4

2^2

sqrt(4)

exp(1) # Euler's constant e

log(exp(1)) # compare to log10(exp(1)) or log10(100)

Check what happens if you forget the closing bracket above. If a bracket or other information is forgotten, R will say “+” to tell it wants more information. In case of a typo, rather that retyping a command, use the arrow up key and edit the typo.

Compute something and assign the result to an object which you give a name (without spaces; instead, use underscore), e.g.,

X <- 2*3 # Don't write X = 2*3 as in Python

The result of the computation[1] is not visible to the user. To see the content of an object, type its name. Notice that R is case sensitive by typing result with lower case r.

Once an object is created, extra elements can be appended

X <- append(X,51) # now X above is overwritten by newly defined X

Remove, say, the first element

X <- X[-1] # note difference with Python; R starts counting at 1

Remove the last element

X <- X[-length(X)]

R distinguishes between numeric and character (or string) data types; the latter, such as names, are written between quotes; try class(X). This constrains the operations you can do on the object. There is also the logical type, saying that something is TRUE or FALSE, for example

2 == 2 # logically correct expression, which is also TRUE

2 = 2 # illegitimate expression that R can't handle

Creating data

Opening a data editor in R is possible in Windows and Linux, but not in Mac. On Mac, use Excel to write data in a file, and save in csv format.

In Windows and Linux, fill out 3 rows and 3 columns with numbers, just to practice:

F <- edit(data.frame())

Close the editor. To re-open for some extra edits or data input:

F <- edit(data.frame(F))

Save data; never use spaces in file names; use underscore instead:

write.table(F, "Filename.txt") # or csv extension

Data import

The following works well on Windows and Linux, not always for Mac (see 10 lines below)

F <- read.table("Filename.txt") # or F <- read.csv("Filename.csv")

csv files sometimes require specification of the column separator, e.g., comma, semicolon, or empty space. Use the "sep" argument:

F <- read.csv("filename.csv", sep=",") # or sep=";" or sep=" "

In case of the error message "line 1 appears to contain embedded nulls", add this argument: fileEncoding = "UTF-16"

If there are variables in columns with variable names at the top, also use header=TRUE, else leave it out:

F <- read.table("Filename.txt", header = TRUE)

To skip missing data (i.e., empty cells) in the file, use the additional argument fill=TRUE

Wrangling data frames

Check the dimensions; the first, say 2, lines; and the last 2 lines of a data frame,

dim(F); head(F,2); tail(F,2) # a semicolon means "and"

Try head(F) without 2.

nrow(F); ncol(F) # numbers of rows and columns in F

Which different values are there in F, or in a row or column of F, for example the first column?

unique(F[ ,1])

How many different values are there in the first column?

length(unique(F[ ,1]))

Also try the following: summary(F)

Change cell values, e.g. turn everything smaller than 0 into 0. Before doing this, keep F in the working session

Z <- F

Z[Z < 0] <- 0 # type Z to check the result

Z[Z != 0] <- 1

Notice that by now, we have recoded the entire data frame to zero's and one's

Z[2,3] # second row and third column in data frame.

Z[ ,3] # third column.

Z[ ,2:3] # for all rows, second-till-third column, e.g.,

Z[ ,2:3] * 5

Y <- Z[ ,-2] # remove second column and keep the rest.

For other sorts of subset taking, check out the subset command online; Google “R subset”.

Take (a chunk of) a data frame, and glue (a chunk of) another data frame to the right[2],

X <- data.frame(F[ ,2], Y) ; X

Recode certain values of a variable e.g., 9 (missings in another program) to NA (missings in R) for the variable in column 1 of data frame F:

for (i in 1:nrow(F)) {

if (F[i,1] == 9) {F[i,1] <- NA}

}

There are shorter ways to recode in R but in this way it is very clear what happens, and you learn to use the for-loop, which you will need anyway. As said above, the entire data frame can be recoded briefly by

F[F == 9] <- NA

F[F == " "] <- NA # recode empty cells to NA

If missing data cause trouble when calculating, for example,

mean(F[ ,1]), you can ignore the missing data in the computation,

mean(F[ ,1], na.rm = TRUE)

The number of missing data in data frame F equals

sum(is.na(F))

Make variable names, e.g. "size_house" and " wealth". You may use spaces and accents in names, but these frills could make data incompatible with other software, so don’t do this. Beware that R can only read straight quotation marks, ", not the oblique ones copied and pasted from Word.

colnames(F) <- c("size_house", "wealth", "income"); F

You can use variable names to refer to columns, e.g., instead of F[ ,2] you can write F$wealth but the latter sometimes yields error messages whereas the former does not; if you have many variables, however, you may use the latter nevertheless in order not to confuse columns.

If you want to order the rows along a variable, for example "wealth" in the 2nd column,

F.ordered <- F[order(-F[ ,2]), ] # minus sign for descending

Watch out: the order function is unstable and sometimes duplicates a few rows in its output!

In case you would want to merge two data frames D and E on the basis of one variable, e.g. "ID", possibly with the rows in different order and possibly with missing values, first install the dplyr package: install.packages("dplyr")

library(dplyr)

F <- full_join(D, E, by="ID") # consider left_join or right_join, depending on the priority of, respectively, D or E

Sequences of numbers, plot, and regression

X <- rnorm(50, mean = 0, sd = 1); mean(X); sd(X)

boxplot(X); hist(X) # try also: sample(X, 3)

Y <- seq(from=1, to=50, by=1)

Z <- sample(1:50, 50) # is the sample biased? Check hist(Z)

Construct a data frame

D <- data.frame(X,Y,Z)

Regression:

mo <- lm(Y ~ X, data=D); summary(mo)

Meaning of the significance codes:

code p-value

*** [0, 0.001]

** (0.001, 0.01]

* (0.01, 0.05]

. (0.05, 0.1]

(0.1, 1]

plot(Y ~ X, xlab="X", ylab="Y", data=D, las = 1)

abline(lm(Y ~ X, data=D), col="red")

Save your plot as pdf or png; par( ) sets the white margins on each side and can be left out if you accept the defaults.

pdf(file = "Filename.pdf") # png(...) for png format

par(mar=c(5,5,1,1)+0.1)

plot(...); abline(...)

dev.off()

Adjust the sizes of the values on the axes, the labels, and the line width with cex.axis, cex.lab and lwd arguments.

Matrix versus data frame

In R, there is a distinction between data types, of which numeric and character (or string) are the most widely used. In a data frame, you can put both numeric and character data, but you can't do matrix algebra. If your data frame contains only numeric data, it is indistinguishable from a matrix, but not so for R, so you have to transform it into a matrix object in order to do matrix algebra. For example,

is.matrix(D) # if R says “FALSE” then:

F <- as.matrix(D) # necessary for matrix operations

Make a unit vector matching the number of rows of F:

I <- matrix(1, ncol = 1, nrow = ncol(F))

D %*% I % # matrix multiplication. Notice that D %*% I = rowSums(F)

t(D) is the transpose of F

Note the difference between X %*% Y and X*Y (the latter is called Hadamard product; it also works for data frames that are no matrices.

To check the current working directory, getwd() To set it, use the menu File > Change dir, or the command: setwd("/discname/foldername/and_so_on")

Check the files in your working directory, list.files()

Check the objects in the workspace of the current working session, objects(), and clean up objects that are not necessary, e.g., remove(X,Y,Z)

Practice all above over and over till the material becomes routine.

General

In sum, R works with virtual objects (coherent packages of information) that you give names you wish, and to which you can assign values, data frames or other, and can be manipulated by functions to compute or plot things, and can be recognized because they grab objects with their rounded brackets. Functions are also objects, and oftentimes you can apply functions in a nested manner a couple of (but not too many) times, e.g. log10(log10(100)) For specialized tasks you will use additional packages. For instance, to import and handle big data files, see for instance the data.table package. To check the number of available packages,

nrow(available.packages(repos = "http://cran.us.r-project.org"))

Your questions have in all likelihood been answered by others; just Google.

If each piece of code has been tested, all code together can be kept in one script, "myscript.txt", including the setting of the working directory, and can be run in one go by the command source("myscript.txt") while you spend your time on other things.

Good practices

Your program will only survive the test of time - and be accepted by your reviewers - if it is understandable beyond your short term memory. After all, your co-authors and reviewers want to understand it, too, and years after you publish a paper based on it, someone may challenge your results. Hence carefully document your code: use comments to point out what each piece of code is supposed to do, put titles above each block, or module, of code--surely also at the top and put empty lines between the blocks; indent loops and if blocks; use variable names that are either self-explanatory or only one letter, but then say in a comment what it means. Finally, note down the version of R and the packages you use.

These guidelines apply to all computer languages. For R in particular, it is good practice to use the base version as much as possible, and only use packages if you could otherwise not do your work, because the base version is much better tested than most of the packages, which sometimes contain bugs. Exceptions are widely used packages made by highly skillful programmers.

Footnotes

[1] Rounding: computers are no mathematicians, and if, for example, the result of a computation is a vector of identical numbers, their standard deviation is mathematically zero, but because of rounding noise in the computer, the standard deviation can be slightly higher than zero.

[2] Use the shorter cbind() command only for numeric variables. To make a new data frame as a stack of column data frames, you can use rbind()