We have seen some examples of how to read data into R and create a data frame, our basic workhorse for manipulating and analyzing data. Here we are going to cover some slightly more advanced topics in data management to handle situations that I guarantee you are going to encounter over and over again in your work. These include:
Sorting on one or more variables in a data set
Selecting a subset of data based on a range of data values or keeping only a subset of the variables (fields)
Summarizing data to create frequency tables and lists, means, sums, and other statistics
Merging data sets that have overlapping fields (variables)
We'll go through these one by one, but first let's review some basics of getting the data into a data frame. You will need:
the R- code to read the data and perform analyses (you need to set your working directory)
the example occupancy, weather, and site data (put these in your working directory)
I'll illustrate with 3 simple data sets that we will read in and manipulate. There are 1) an occupancy data set with presence/absence detections on 5 sites over 5 days , 2) a weather data set with temperature readings each day, and 3) a list of measurements of habitat (cover) on each site. Notice that the data sets are different sizes, reflecting that presence can change over sites and days, temperature over days but not sites (in this example), and cover over sites but not over days.
First read the data in
> #SET THE WORKING DIRECTORY
> #set up your working directory. Here's mine:
> data.dir<-"C:/Users/mike/Dropbox/teaching/FANR4460/spring2014"
> setwd(data.dir)
> #READ IN OCCUPANCY WEATHER AND SITE DATA
> occupancy<-read.csv(file="occupancy.csv")
> weather<-read.csv(file="weather.csv")
> site<-read.csv(file="site_variables.csv")
> #DISPLAY THE 3 DATA FRAMES
> occupancy
day site presence
1 1 1 0
2 1 2 0
3 1 3 1
4 1 4 1
5 1 5 1
6 2 1 1
7 2 2 1
8 2 3 1
9 2 4 0
10 2 5 0
11 3 1 0
12 3 2 1
13 3 3 0
14 3 4 0
15 3 5 1
16 4 1 1
17 4 2 1
18 4 3 0
19 4 4 1
20 4 5 0
21 5 1 1
22 5 2 1
23 5 3 0
24 5 4 1
25 5 5 1
> weather
day temp
1 1 19.14794
2 2 15.50146
3 3 12.33415
4 4 10.19110
5 5 13.77786
> site
site cover
1 1 49.05201
2 2 44.43004
3 3 54.62078
4 4 49.47311
5 5 45.74449
>
It's usually a good idea to perform some simple summaries (for one thing this will diagnose wacky values, which are often errors in data entry)
> summary(occupancy)
day site presence
Min. :1 Min. :1 Min. :0.0
1st Qu.:2 1st Qu.:2 1st Qu.:0.0
Median :3 Median :3 Median :1.0
Mean :3 Mean :3 Mean :0.6
3rd Qu.:4 3rd Qu.:4 3rd Qu.:1.0
Max. :5 Max. :5 Max. :1.0
> summary(weather)
day temp
Min. :1 Min. :10.19
1st Qu.:2 1st Qu.:12.33
Median :3 Median :13.78
Mean :3 Mean :14.19
3rd Qu.:4 3rd Qu.:15.50
Max. :5 Max. :19.15
> summary(site)
site cover
Min. :1 Min. :44.43
1st Qu.:2 1st Qu.:45.74
Median :3 Median :49.05
Mean :3 Mean :48.66
3rd Qu.:4 3rd Qu.:49.47
Max. :5 Max. :54.62
Some R functions will require us to access the variable names in a data frame directly. This can be done with 'attach' but I have found that this can lead to confusioo and advise against it. One method is to use the data frame name followed by $ and the variable name. For example,
>occupancy$day
will provide a list of all the "day" observations in the data set. Another very useful method is the 'with' command as illustrated below:
> #I PREFER TO USE "WITH" TO DESIGNATE THAT I AM USING A PARTICULAR DATA SET THAN 'INCLUDE' THE INCLUDE THE VARIABLES AS LISTS. I FIND THIS LESS CONFUSING OVERALL
>
> #SORTING ON A VARIABLE
> with(occupancy,occupancy[order(presence),])
day site presence
1 1 1 0
2 1 2 0
9 2 4 0
10 2 5 0
11 3 1 0
13 3 3 0
14 3 4 0
18 4 3 0
20 4 5 0
23 5 3 0
3 1 3 1
4 1 4 1
5 1 5 1
6 2 1 1
7 2 2 1
8 2 3 1
12 3 2 1
15 3 5 1
16 4 1 1
17 4 2 1
19 4 4 1
21 5 1 1
22 5 2 1
24 5 4 1
25 5 5 1
>
> #2 VARIABLES
> with(occupancy,occupancy[order(site,presence),])
day site presence
1 1 1 0
11 3 1 0
6 2 1 1
16 4 1 1
21 5 1 1
2 1 2 0
7 2 2 1
12 3 2 1
17 4 2 1
22 5 2 1
13 3 3 0
18 4 3 0
23 5 3 0
3 1 3 1
8 2 3 1
9 2 4 0
14 3 4 0
4 1 4 1
19 4 4 1
24 5 4 1
10 2 5 0
20 4 5 0
5 1 5 1
15 3 5 1
25 5 5 1
> #if you want, use the sorted data to create a new dataframe, or replace the old one:
> occupancy2<-with(occupancy,occupancy[order(presence),])
>
Creating subsets of the data
A very common need is to be able to choose portions of the data known as subsets. These can be based on ranges of variables in the data, by selection of only certain variable columns, or both. Some examples follow
> #CREATE A SUBSET OF THE DATA
> #select a range of a variable
> subset(weather,temp<15)
day temp
3 3 12.33415
4 4 10.19110
5 5 13.77786
> #select only certain columns
> subset(occupancy,select=c(day,presence))
day presence
1 1 0
2 1 0
3 1 1
4 1 1
5 1 1
6 2 1
7 2 1
8 2 1
9 2 0
10 2 0
11 3 0
12 3 1
13 3 0
14 3 0
15 3 1
16 4 1
17 4 1
18 4 0
19 4 1
20 4 0
21 5 1
22 5 1
23 5 0
24 5 1
25 5 1
> #combine variable and field selection
> subset(occupancy,presence>0,select=c(day,presence))
day presence
3 1 1
4 1 1
5 1 1
6 2 1
7 2 1
8 2 1
12 3 1
15 3 1
16 4 1
17 4 1
19 4 1
21 5 1
22 5 1
24 5 1
25 5 1
Again, you can use any of these to create a new or replace the old data frame, e.g.,
> occupancy.new<-subset(occupancy,presence>0,select=c(day,presence))
>
Summarizing the data
R has a number useful functions for summarizing data, and these can often be combined with other data functions for more utility.
The table() function provides a frequency table of the number of observations that fall into 1 or more categories. To illustrate
>
> #SUMMARIZING DATA
> #frequency table by attributes
> with(occupancy,table(presence))
presence
0 1
10 15
provides a table of frequencies of presence and absence for the 25 observations (10 absent, 15 present). We can get a table of presence/ absence by day as follows
> #2factors
> with(occupancy,table(presence,day))
day
presence 1 2 3 4 5
0 2 2 3 2 1
1 3 3 2 3 4
Finally, we can use this table as a data frame
> #use this as a data frame
> occup_sum<-as.data.frame(with(occupancy,table(presence,day)))
> occup_sum
presence day Freq
1 0 1 2
2 1 1 3
3 0 2 2
4 1 2 3
5 0 3 3
6 1 3 2
7 0 4 2
8 1 4 3
9 0 5 1
10 1 5 4
This can be handy, because we might later want to analyze the data in this summary fashion, or combine it with other data (see below).
Joining and merging data sets
Very often you will need to join different data sets together. This can be easy in some simple cases, or a bit more difficult in more complicated ones. The simple cases involve joining together different data sets that are already lines up by having either the same number of rows (but different columns) or the same number (and importantly, variable names) of columns. We join these by the cbind() and rbind() commands, as illustrated.
> #JOINT DATA SETS TOGETHER
> #COLUMN BINDING
> #data sets A and B have same number of rows but represent different variables
> #example
> A<-subset(weather,select=day)
> B<-subset(weather,select=temp)
> A
day
1 1
2 2
3 3
4 4
5 5
> B
temp
1 19.14794
2 15.50146
3 12.33415
4 10.19110
5 13.77786
> #put them back together
> cbind(A,B)
day temp
1 1 19.14794
2 2 15.50146
3 3 12.33415
4 4 10.19110
5 5 13.77786
> #ROW BINDING
> #data sets A and B have variables (and number of columns) and we are basically pasted them together
> #example
> A<-subset(weather,day<3)
> B<-subset(weather,day>2)
> A
day temp
1 1 19.14794
2 2 15.50146
> B
day temp
3 3 12.33415
4 4 10.19110
5 5 13.77786
> #put them back together
> rbind(A,B)
day temp
1 1 19.14794
2 2 15.50146
3 3 12.33415
4 4 10.19110
5 5 13.77786
Merging data sets by common variables
Many data joining problems are not this simple, and in fact our small data examples for occupancy, weather, and site variables are a case in point.
Here are the data again:
> #example- merge the occupancy, weather , and site data. Requires 2 steps
> occupancy
day site presence
1 1 1 0
2 1 2 0
3 1 3 1
4 1 4 1
5 1 5 1
6 2 1 1
7 2 2 1
8 2 3 1
9 2 4 0
10 2 5 0
11 3 1 0
12 3 2 1
13 3 3 0
14 3 4 0
15 3 5 1
16 4 1 1
17 4 2 1
18 4 3 0
19 4 4 1
20 4 5 0
21 5 1 1
22 5 2 1
23 5 3 0
24 5 4 1
25 5 5 1
> weather
day temp
1 1 19.14794
2 2 15.50146
3 3 12.33415
4 4 10.19110
5 5 13.77786
> site
site cover
1 1 49.05201
2 2 44.43004
3 3 54.62078
4 4 49.47311
5 5 45.74449
>
Note that nothing really lines up for these data sets. The occupancy data has 25 rows and 3 columns, the site and weather have 5 rows and 2 columns but the rows and columns don't mean the same thing (e.g., rows are days in one and sites in the other). But also notice that all 3 share at least 1 variable: day or site. We can use this fact to merge the data.
> #step 1 merge occupancy and weather by common factor (day)
> combined<-merge(occupancy,weather)
> combined
day site presence temp
1 1 1 0 19.14794
2 1 2 0 19.14794
3 1 3 1 19.14794
4 1 4 1 19.14794
5 1 5 1 19.14794
6 2 1 1 15.50146
7 2 2 1 15.50146
8 2 3 1 15.50146
9 2 4 0 15.50146
10 2 5 0 15.50146
11 3 1 0 12.33415
12 3 2 1 12.33415
13 3 3 0 12.33415
14 3 4 0 12.33415
15 3 5 1 12.33415
16 4 1 1 10.19110
17 4 2 1 10.19110
18 4 3 0 10.19110
19 4 4 1 10.19110
20 4 5 0 10.19110
21 5 1 1 13.77786
22 5 2 1 13.77786
23 5 3 0 13.77786
24 5 4 1 13.77786
25 5 5 1 13.77786
>
> #step 2 merge this data set with site by common factor (site)
> combined<-merge(combined,site)
> combined
site day presence temp cover
1 1 1 0 19.14794 49.05201
2 1 2 1 15.50146 49.05201
3 1 3 0 12.33415 49.05201
4 1 4 1 10.19110 49.05201
5 1 5 1 13.77786 49.05201
6 2 1 0 19.14794 44.43004
7 2 2 1 15.50146 44.43004
8 2 3 1 12.33415 44.43004
9 2 4 1 10.19110 44.43004
10 2 5 1 13.77786 44.43004
11 3 1 1 19.14794 54.62078
12 3 2 1 15.50146 54.62078
13 3 3 0 12.33415 54.62078
14 3 4 0 10.19110 54.62078
15 3 5 0 13.77786 54.62078
16 4 1 1 19.14794 49.47311
17 4 2 0 15.50146 49.47311
18 4 3 0 12.33415 49.47311
19 4 4 1 10.19110 49.47311
20 4 5 1 13.77786 49.47311
21 5 1 1 19.14794 45.74449
22 5 2 0 15.50146 45.74449
23 5 3 1 12.33415 45.74449
24 5 4 0 10.19110 45.74449
25 5 5 1 13.77786 45.74449
>
The end product is a combined data structure that has all 3 elements.
# Note: in the above examples it was obvious that only one column (day or site) was in common with the 2 datasets, so
# R knows to merge by it. In more complicated cases you have to specify by=day etc
merge(occupancy,weather,by="day")
#does the same thing as
merge(occupancy,weather)