Reading and manipulating data

Data input and output

Getting data into R

We have already seen very simple of how to input data values into R, using the R console or equivalently script line commands. Again, for example

>a<-2

sets the value of the object a as 2. A more complicated example occurs when we have a list of values, also known as a vector. So for example suppose that we have a list of the days on which an animal is captured over a 5-day capture study. For an individual animal we might capture the animal on the first occasion, not on the second, and then recapture it on each occasion 3 -5. We could summarize the data for that animal in a list object as

>captures<- c(1,3,4,5)

where “c()” is a collection operator that tells R to expect a list of numbers.

If we type “captures” on the command line, we get the 5 values displayed, separated by spaces.

> captures

[1] 1,3,4,5

A dataframe is an R object in which the data entries are in rows and columns, with the rows being individual objects and the columns being the different variables that are recorded for each observation. Dataframes can be built interactively, or “coerced” (converted) from other objects. It will often be convenient to read data from an external file (or files) and convert them into a dataframe.

We can illustrate both the use of external files and the creation of dataframes by a simple elaboration of the capture example. Suppose that we capture and mark several animals (I hope it's a lot, but our example will be small. So now our records consist of 2 pieces of information (1) a label for the animals identity (say, "A",B","C", etc.) and (2) the days that the animal was captured. We might also have information unique to each animal (such as its body mass at capture) or unique to the capture day (e.g., temperature). Our example data now might look like this

and are saved in this column format in a comma-delimited text or csv file; csv files are easily created using Excel or other spreadsheet program, or as output from most common data management programs (dBase, Access, etc.). This is a typical (and good) way to record field data in a CMR study; we will see later how to convert this into a format that can be used for our CMR programs.

In the above, the labels "NA" stand for "not available", meaning that (in this example) mass was not recorded for some animals on some of the study days. This is a standard way of denoting missing values in R (and is strongly advised over other approaches such as leaving the field blank, using dashes, 99999, or some such).

The command

>captures<-read.csv("capture_data.csv",header=T)

will read the data from the csv file (expecting that the columns are separated by commas). The specification header=T tells R to expect the first line of the file to be a line designating the variable name; Entering the command “captures” confirms that the data were entered in properly. Dataframes will be the standard way we will input and manipulate data in R, and are readily used by most of the programs we will use for analysis. Finally, the attached script file saves the R commands used to read the data and create the data frame, and also produces some simple summary statistics.

Saving data/ results

There are a number of ways to save data, computations, and other results from R sessions. We’ve already seen how R objects can be created, and can be referred to in later R sessions if the workspace is saved and restored. There are many other formats under which data can be saved using R, but we will focus on simple formats similar to the ones we just used to read in data: the delimited text file. The command “write.table” or even more simply “write.csv” will take a dataframe as input and produce a comma-delimited text file as output.

For example the command

>write.csv(captures,file='captures.new.csv')

will take the current dataframe captures and write it to a new file (captures.new.csv), preserving the column headers and the comma separated format. Of course we aren’t usually going to care about simply cloning copies of the data this way, so a more practical use occurs when we create new computation in our program and wish to save the results. So for example, suppose that we wished to create a new variable called Temp.F, in which we convert Temp from C to F. We could create the new variable and include it in the revised output file by the commands

temp.F<-function(C){32+9/5*C}

#apply to the temp observations in dataframe

captures$Temp.F<-temp.F(captures$Temp)

write.csv(captures,file="captures.new.csv")

A few things are worth noting here:

I snuck in a user-defined function to convert temperature. More on this later.
R is case sensitive! So temp and Temp are 2 different objects. This can be a huge source of errors if you're not careful
By default write.csv (and write.table) will write new data to the specified file name, so if data are present these will be overwritten. Data can be appended to existing data by using the write.table command and the “append=TRUE” option. The write.table command also permits different types of delimiters and options for headers. We will use write.table or other approaches for writing data to files as these more specific needs arise.

Summarizing and manipulating data

Here we will use a slightly larger data example to illustrate some basic methods for summarizing and manipulating data. The data example involves 101 samples of insect counts along an elevation gradient from 0 to 1000m in 10-m intervals; the data are contained in a csv file. We read the data into R using the commands

>insects<-read.table("insects.csv",header=T,sep=",")

>attach(insects)

First, we can obtain some simple summary statistics either for all the variables at once, or for a single variable at time. For example

>summary(insects)

provides

plot elevation count

Min. : 1 Min. : 0 Min. : 156

1st Qu.: 26 1st Qu.: 250 1st Qu.: 528

Median : 51 Median : 500 Median : 1856

Mean : 51 Mean : 500 Mean : 4452

3rd Qu.: 76 3rd Qu.: 750 3rd Qu.: 6390

Max. :101 Max. :1000 Max. :21991

> mean(elevation)

[1] 500

> sd(elevation)

[1] 293.0017

> range(elevation)

[1] 0 1000

> mean(count)

[1] 4451.911

> sd(count)

[1] 5572.118

> range(count)

[1] 156 21991

A variety of other summary statistics such as median, mode, percentiles, and others are obtained similarly. For example

>quantile(count,c(.01,.5,.9))

provides the 1%, 50%, and 90% quantiles (percentiles) of the variable ‘count’.

Likewise,

>median(count)

provides the median value for count, which you can confirm is the same as the 50%th percentile.

All of the above commands are saved in an R script file.

Next: Statistical and random number functions

Page updated

Google Sites

Report abuse