Advanced Data Handling

We have seen some examples of how to read data into R and create a data frame, our basic workhorse for manipulating and analyzing data. Here we are going to cover some slightly more advanced topics in data management to handle situations that I guarantee you are going to encounter over and over again in your work. These include:

Sorting on one or more variables in a data set
Selecting a subset of data based on a range of data values or keeping only a subset of the variables (fields)
Summarizing data to create frequency tables and lists, means, sums, and other statistics
Merging data sets that have overlapping fields (variables)

We'll go through these one by one, but first let's review some basics of getting the data into a data frame. You will need:

the R- code to read the data and perform analyses (you need to set your working directory)
the example occupancy, weather, and site data (put these in your working directory)

I'll illustrate with 3 simple data sets that we will read in and manipulate. There are 1) an occupancy data set with presence/absence detections on 5 sites over 5 days , 2) a weather data set with temperature readings each day, and 3) a list of measurements of habitat (cover) on each site. Notice that the data sets are different sizes, reflecting that presence can change over sites and days, temperature over days but not sites (in this example), and cover over sites but not over days.

First read the data in

> #SET THE WORKING DIRECTORY

> #set up your working directory. Here's mine:

> data.dir<-"C:/Users/mike/Dropbox/teaching/FANR4460/spring2014"

> setwd(data.dir)

> #READ IN OCCUPANCY WEATHER AND SITE DATA

> occupancy<-read.csv(file="occupancy.csv")

> weather<-read.csv(file="weather.csv")

> site<-read.csv(file="site_variables.csv")

> #DISPLAY THE 3 DATA FRAMES

> occupancy

day site presence

1 1 1 0

2 1 2 0

3 1 3 1

4 1 4 1

5 1 5 1

6 2 1 1

7 2 2 1

8 2 3 1

9 2 4 0

10 2 5 0

11 3 1 0

12 3 2 1

13 3 3 0

14 3 4 0

15 3 5 1

16 4 1 1

17 4 2 1

18 4 3 0

19 4 4 1

20 4 5 0

21 5 1 1

22 5 2 1

23 5 3 0

24 5 4 1

25 5 5 1

> weather

day temp

1 1 19.14794

2 2 15.50146

3 3 12.33415

4 4 10.19110

5 5 13.77786

> site

site cover

1 1 49.05201

2 2 44.43004

3 3 54.62078

4 4 49.47311

5 5 45.74449

>

It's usually a good idea to perform some simple summaries (for one thing this will diagnose wacky values, which are often errors in data entry)

> summary(occupancy)

day site presence

Min. :1 Min. :1 Min. :0.0

1st Qu.:2 1st Qu.:2 1st Qu.:0.0

Median :3 Median :3 Median :1.0

Mean :3 Mean :3 Mean :0.6

3rd Qu.:4 3rd Qu.:4 3rd Qu.:1.0

Max. :5 Max. :5 Max. :1.0

> summary(weather)

day temp

Min. :1 Min. :10.19

1st Qu.:2 1st Qu.:12.33

Median :3 Median :13.78

Mean :3 Mean :14.19

3rd Qu.:4 3rd Qu.:15.50

Max. :5 Max. :19.15

> summary(site)

site cover

Min. :1 Min. :44.43

1st Qu.:2 1st Qu.:45.74

Median :3 Median :49.05

Mean :3 Mean :48.66

3rd Qu.:4 3rd Qu.:49.47

Max. :5 Max. :54.62

Some R functions will require us to access the variable names in a data frame directly. This can be done with 'attach' but I have found that this can lead to confusioo and advise against it. One method is to use the data frame name followed by $ and the variable name. For example,

>occupancy$day

will provide a list of all the "day" observations in the data set. Another very useful method is the 'with' command as illustrated below:

> #I PREFER TO USE "WITH" TO DESIGNATE THAT I AM USING A PARTICULAR DATA SET THAN 'INCLUDE' THE INCLUDE THE VARIABLES AS LISTS. I FIND THIS LESS CONFUSING OVERALL

>

> #SORTING ON A VARIABLE

> with(occupancy,occupancy[order(presence),])

day site presence

1 1 1 0

2 1 2 0

9 2 4 0

10 2 5 0

11 3 1 0

13 3 3 0

14 3 4 0

18 4 3 0

20 4 5 0

23 5 3 0

3 1 3 1

4 1 4 1

5 1 5 1

6 2 1 1

7 2 2 1

8 2 3 1

12 3 2 1

15 3 5 1

16 4 1 1

17 4 2 1

19 4 4 1

21 5 1 1

22 5 2 1

24 5 4 1

25 5 5 1

>

> #2 VARIABLES

> with(occupancy,occupancy[order(site,presence),])

day site presence

1 1 1 0

11 3 1 0

6 2 1 1

16 4 1 1

21 5 1 1

2 1 2 0

7 2 2 1

12 3 2 1

17 4 2 1

22 5 2 1

13 3 3 0

18 4 3 0

23 5 3 0

3 1 3 1

8 2 3 1

9 2 4 0

14 3 4 0

4 1 4 1

19 4 4 1

24 5 4 1

10 2 5 0

20 4 5 0

5 1 5 1

15 3 5 1

25 5 5 1

> #if you want, use the sorted data to create a new dataframe, or replace the old one:

> occupancy2<-with(occupancy,occupancy[order(presence),])

>

Creating subsets of the data

A very common need is to be able to choose portions of the data known as subsets. These can be based on ranges of variables in the data, by selection of only certain variable columns, or both. Some examples follow

> #CREATE A SUBSET OF THE DATA

> #select a range of a variable

> subset(weather,temp<15)

day temp

3 3 12.33415

4 4 10.19110

5 5 13.77786

> #select only certain columns

> subset(occupancy,select=c(day,presence))

day presence

1 1 0

2 1 0

3 1 1

4 1 1

5 1 1

6 2 1

7 2 1

8 2 1

9 2 0

10 2 0

11 3 0

12 3 1

13 3 0

14 3 0

15 3 1

16 4 1

17 4 1

18 4 0

19 4 1

20 4 0

21 5 1

22 5 1

23 5 0

24 5 1

25 5 1

> #combine variable and field selection

> subset(occupancy,presence>0,select=c(day,presence))

day presence

3 1 1

4 1 1

5 1 1

6 2 1

7 2 1

8 2 1

12 3 1

15 3 1

16 4 1

17 4 1

19 4 1

21 5 1

22 5 1

24 5 1

25 5 1

Again, you can use any of these to create a new or replace the old data frame, e.g.,

> occupancy.new<-subset(occupancy,presence>0,select=c(day,presence))

>

Summarizing the data

R has a number useful functions for summarizing data, and these can often be combined with other data functions for more utility.

The table() function provides a frequency table of the number of observations that fall into 1 or more categories. To illustrate

>

> #SUMMARIZING DATA

> #frequency table by attributes

> with(occupancy,table(presence))

presence

0 1

10 15

provides a table of frequencies of presence and absence for the 25 observations (10 absent, 15 present). We can get a table of presence/ absence by day as follows

> #2factors

> with(occupancy,table(presence,day))

day

presence 1 2 3 4 5

0 2 2 3 2 1

1 3 3 2 3 4

Finally, we can use this table as a data frame

> #use this as a data frame

> occup_sum<-as.data.frame(with(occupancy,table(presence,day)))

> occup_sum

presence day Freq

1 0 1 2

2 1 1 3

3 0 2 2

4 1 2 3

5 0 3 3

6 1 3 2

7 0 4 2

8 1 4 3

9 0 5 1

10 1 5 4

This can be handy, because we might later want to analyze the data in this summary fashion, or combine it with other data (see below).

Joining and merging data sets

Very often you will need to join different data sets together. This can be easy in some simple cases, or a bit more difficult in more complicated ones. The simple cases involve joining together different data sets that are already lines up by having either the same number of rows (but different columns) or the same number (and importantly, variable names) of columns. We join these by the cbind() and rbind() commands, as illustrated.

> #JOINT DATA SETS TOGETHER

> #COLUMN BINDING

> #data sets A and B have same number of rows but represent different variables

> #example

> A<-subset(weather,select=day)

> B<-subset(weather,select=temp)

> A

day

1 1

2 2

3 3

4 4

5 5

> B

temp

1 19.14794

2 15.50146

3 12.33415

4 10.19110

5 13.77786

> #put them back together

> cbind(A,B)

day temp

1 1 19.14794

2 2 15.50146

3 3 12.33415

4 4 10.19110

5 5 13.77786

> #ROW BINDING

> #data sets A and B have variables (and number of columns) and we are basically pasted them together

> #example

> A<-subset(weather,day<3)

> B<-subset(weather,day>2)

> A

day temp

1 1 19.14794

2 2 15.50146

> B

day temp

3 3 12.33415

4 4 10.19110

5 5 13.77786

> #put them back together

> rbind(A,B)

day temp

1 1 19.14794

2 2 15.50146

3 3 12.33415

4 4 10.19110

5 5 13.77786

Merging data sets by common variables

Many data joining problems are not this simple, and in fact our small data examples for occupancy, weather, and site variables are a case in point.

Here are the data again:

> #example- merge the occupancy, weather , and site data. Requires 2 steps

> occupancy

day site presence

1 1 1 0

2 1 2 0

3 1 3 1

4 1 4 1

5 1 5 1

6 2 1 1

7 2 2 1

8 2 3 1

9 2 4 0

10 2 5 0

11 3 1 0

12 3 2 1

13 3 3 0

14 3 4 0

15 3 5 1

16 4 1 1

17 4 2 1

18 4 3 0

19 4 4 1

20 4 5 0

21 5 1 1

22 5 2 1

23 5 3 0

24 5 4 1

25 5 5 1

> weather

day temp

1 1 19.14794

2 2 15.50146

3 3 12.33415

4 4 10.19110

5 5 13.77786

> site

site cover

1 1 49.05201

2 2 44.43004

3 3 54.62078

4 4 49.47311

5 5 45.74449

>

Note that nothing really lines up for these data sets. The occupancy data has 25 rows and 3 columns, the site and weather have 5 rows and 2 columns but the rows and columns don't mean the same thing (e.g., rows are days in one and sites in the other). But also notice that all 3 share at least 1 variable: day or site. We can use this fact to merge the data.

> #step 1 merge occupancy and weather by common factor (day)

> combined<-merge(occupancy,weather)

> combined

day site presence temp

1 1 1 0 19.14794

2 1 2 0 19.14794

3 1 3 1 19.14794

4 1 4 1 19.14794

5 1 5 1 19.14794

6 2 1 1 15.50146

7 2 2 1 15.50146

8 2 3 1 15.50146

9 2 4 0 15.50146

10 2 5 0 15.50146

11 3 1 0 12.33415

12 3 2 1 12.33415

13 3 3 0 12.33415

14 3 4 0 12.33415

15 3 5 1 12.33415

16 4 1 1 10.19110

17 4 2 1 10.19110

18 4 3 0 10.19110

19 4 4 1 10.19110

20 4 5 0 10.19110

21 5 1 1 13.77786

22 5 2 1 13.77786

23 5 3 0 13.77786

24 5 4 1 13.77786

25 5 5 1 13.77786

>

> #step 2 merge this data set with site by common factor (site)

> combined<-merge(combined,site)

> combined

site day presence temp cover

1 1 1 0 19.14794 49.05201

2 1 2 1 15.50146 49.05201

3 1 3 0 12.33415 49.05201

4 1 4 1 10.19110 49.05201

5 1 5 1 13.77786 49.05201

6 2 1 0 19.14794 44.43004

7 2 2 1 15.50146 44.43004

8 2 3 1 12.33415 44.43004

9 2 4 1 10.19110 44.43004

10 2 5 1 13.77786 44.43004

11 3 1 1 19.14794 54.62078

12 3 2 1 15.50146 54.62078

13 3 3 0 12.33415 54.62078

14 3 4 0 10.19110 54.62078

15 3 5 0 13.77786 54.62078

16 4 1 1 19.14794 49.47311

17 4 2 0 15.50146 49.47311

18 4 3 0 12.33415 49.47311

19 4 4 1 10.19110 49.47311

20 4 5 1 13.77786 49.47311

21 5 1 1 19.14794 45.74449

22 5 2 0 15.50146 45.74449

23 5 3 1 12.33415 45.74449

24 5 4 0 10.19110 45.74449

25 5 5 1 13.77786 45.74449

>

The end product is a combined data structure that has all 3 elements.

# Note: in the above examples it was obvious that only one column (day or site) was in common with the 2 datasets, so

# R knows to merge by it. In more complicated cases you have to specify by=day etc

merge(occupancy,weather,by="day")

#does the same thing as

merge(occupancy,weather)