R tutorial 1: reading and manipulating data

[I originally posted this on Raptor's Nest blog so there are references to blogger here, but just ignore it]

I had been preparing a comprehensive tutorial on how to plot in R (The R Project) with different groups differentiated in different colours, but Blogger stupidly erased my post and decided to automatically save my empty draft at that precise moment. Since I cannot reproduce the original post, I decided to break it up into a series of smaller topics.

There are plenty of R resources available in various places but I found that they are frequently one of two extremes; either too basic or too advanced. I think of myself as an intermediate user (i.e., I can comfortably handle canned packages but want a bit more control than the default settings allow) so the type of info I find are not too helpful. So I thought it would benefit others like me if I summed up some of the simple things I learned over the last year or two.

As a first of such posts, I will deal with reading in and manipulating data. These may be very simple and basic, but some of the things I wanted to do required a bit more than reading a manual. I will try and explain things as simply as I can so that beginners can also find some use from these posts.

So here we go.

First, we should set up the working directory. This is the directory (or folder) where you want R to read in data from and write out results to. You don't have to do this but it's sometimes useful to do so.

In Windows, you can find a drop down menu "Change dir..." under the "File" menu. In Mac's this would be under the "Miscellaneous" menu. This prompts you to select a directory. I don't particularly like this approach because it takes time to navigate through many levels of directories to get to the one you are looking at; e.g. select "C Drive", select "Users", select "YOUR USERNAME", select "Documents"… etc… or whatever your pathway is.

An alternative is to use the setwd() function, for instance like this:

setwd("C:/Users/User Name/Documents/FOLDER")

Note that the pathway (C:/…) has to be within quotes (“…”) and the pathway separators are slashes (/) instead of backslashes (\) as in Windows pathway displays. If you are unsure if you have set your working directory correctly, then you can check by getting working directory, getwd().

Now that you have set your working directory we can start reading in our data. This would require that you have your data stored as a tab delimited txt file or something similar like comma delimited csv file for instance. For this example, I will use my published dataset of theropod biting performance measures. The txt file looks roughly as follows:

Taxa B0 B1 B2 Family

Acrocanthosaurus 0.307931296 -0.00329298 3.28E-05 Allosauroidea

Allosaurus 0.302008604 -0.002847656 2.04E-05 Allosauroidea

Archaeopteryx 0.142338967 -0.000870802 2.98E-06 Aves

Bambiraptor 0.181541103 -0.001606 1.10E-05 Dromaeosauridae

Baryonychid 0.189377202 -0.00237557 2.20E-05 Basal_Tetanurae

Carcharodontosaurus 0.368623687 -0.005015715 5.82E-05 Allosauroidea

.

.

.

The first column contains the names of the theropods, second to fourth the data and the fifth column the family names, as evident from the first row. We want to keep this structure so we will read in the data telling R to acknowledge the first row as the header and the first column as the row names:

data <- read.table("FILENAME.txt", header=T, row.names=1)

Here the data is read in and stored as an object called “data”. The FILENAME has to be within “”. The bit “header=T” or “header=TRUE” specifies that the first row is a header and “row.names=1” specifies the first column as row names. You can review your data by typing in “data” which would print out your data table, or you can type “str(data)” which will show you a compact description of the structure of your object “data”. The latter will return a list that looks like this:

> str(data)

'data.frame': 42 obs. of 4 variables:

$ B0 : num 0.308 0.302 0.142 0.182 0.189 ...

$ B1 : num -0.00329 -0.00285 -0.00087 -0.00161 -0.00238 ...

$ B2 : num 3.28e-05 2.04e-05 2.98e-06 1.10e-05 2.20e-05 ...

$ Family: Factor w/ 13 levels "Allosauroidea",..: 1 1 2 6 4 1 8 8 11 5 ...

This tells us that object “data” is of the class “data.frame” with 42 observations (our 42 dinosaurs) and 4 variables (B0, B1, B2, and Family). Variables “B0”, “B1”, and “B2” are numerical data but “Family” is a factor. For some analyses like principal components analysis, non-numerical variables like “Family” cannot be included, so we will have to exclude this variable (more on this later). The variables (or any other content of an object) are indicated by a “$” and you can always call up an individual variable within an object, e.g. “data$B0”. This is useful when you want to use specific components of an object for analyses (for instance a regression of B0 against B1) or plotting (e.g. B0 against B1) (more on plotting in my next post).

Next, I’d like to explain briefly the structure of R data tables. For instance, “data” is a 42 by 4 data matrix in terms of rows vs columns, which is how R handles tables; the format that R understands tables is [rows,columns]. So if you want to see the B2 value for Allosaurus then you would type “data[2,3]” because Allosaurus is the second row and B2 is the 3 column and R will return that value which is “2.04e-05”. Similarly, if you want to review all the values for B0, then you would type “data[,1]” to call up the entire first column (or alternatively you can type “data$B0” as I’ve described above). If you want to review all the values for a given taxon (row), let’s say Allosaurus, then you would type, “data[2,]”, which returns:

> data[2,]

B0 B1 B2 Family

Allosaurus 0.3020086 -0.002847656 2.04e-05 Allosauroidea

Now we can move on to manipulating data in the simplest ways. As I’ve mentioned above, some analyses don’t like non-numerical data and we would have to eliminate the column “Family” from “data” for these analyses. One way to do this is to compile a new table using the cbind() function like this:

data2 <- cbind(data$B0, data$B1, data$B2)*1

This will bind the vectors “data$B0”, “data$B1”, and “data$B2” together into a table. Unfortunately, the row names and column headers are stripped in the process so we have to assign them again. For row names we can simply take them from “data”:

rownames(data2) <- rownames(data)*1

Column names on the other hand are a bit more troublesome as there are four columns in “data” and only three in “data2”. We have to directly name them like this:

colnames(data2) <- c(“B0”, “B1”, “B2”)*2

The function cbind() also seems to create a object of class “matrix” so if you want a “data.frame” instead (which is useful if you want to use the $ operator to call individual columns) then we’d need to reassign “data2” as a data.frame object:

data2 <- data.frame(data2) [UPDATE: this process is unnecessary if you follow *1 below]

Using cbind() to create a data table of desired columns is fine just as long as the number of variables is manageable. In many cases (such as large multivariate data sets) this is not possible, so we need to resort to an alternative, which is to delete columns or rows. This simple procedure of deleting rows/columns is not straightforward in R and it took me a bit of searching before I found how to do it. Let’s start with deleting a variable, in our example, the non-numerical variable “Family”. Since family is the fourth column in “data”, we have to somehow eliminate data[,4]. It turns out that it is actually quite simple; just put a “-“ in front of the column (or row) number:

data3 <- data[,-4]

By typing in “length(data3[1,])”, which shows you the number of items in the first row in the new data set “data3”, R should return a value of “3” . The command “str(data3)” should also give a short list with three variables.

The same can be done for rows; just put a “-“ in front of the row number you wish to eliminate. For instance, if we want to delete Allosaurus from “data3”, then we would type:

data4 <- data3[-2,]

We can also delete multiple rows (or columns) at once. I will give an example first:

data5 <- data3[-c(2,7),]

Here, I specified the second and seventh rows to be deleted from “data3”. The “c(2,7)” combines values “2” and “7” into a vector or a list; this is the format that R likes for lists of values. So our row specification of data3[row,column] is a vector (list) including the values “2” and “7”. And there is a “-“ in front of it to tell R to delete the values within this list. Of course, you can always simply repeat the code to produce “data4” (see above) and eventually get the same thing as “data5” but that involves some tedious coding if you have a lot of rows to eliminate.

Multiple columns can also be deleted simultaneously in a similar manner:

data6 <- data[,-c(3,4)]

This removes columns 3 and 4 from the original data set “data” (which incidentally is still stored within R’s memory as a separate object because all the data manipulation has been stored under new names each time, i.e. “dataN”). The resulting “data6” should now have two columns, “B0” and “B1”.

I think that’s enough for now. In my next post I will either explain how to deal with missing data or how to plot basic X-Y plots but with colours (families plotted in different colour).

[UPDATE]

*1. I've since found a way of binding columns and assigning row names in a single line of script:

data2 <- data.frame(data$B0, data$B1, data$B2, row.names=row.names(data))

This should make a new data.frame object called data2 with the columns data$B0, data$B1 and data$B2, assigning row names according to those of data by the argumentrow.names=row.names(data).

*2. There is also an easy way of taking certain column names from another object, in this case data:

colnames(data2) <- colnames(data)[1:3]

This would assign the first to the third elements within the vector containing the column names for data. Alternatively you can do:

colnames(data2) <- colnames(data)[c(1,2,3)]