Now that we have learned the basics about data types and data structures and the meaning of data merging we are going to do introduce advanced data manipulations commonly found in natural resource management. In this chapter, we are going to learn different strategies on how to summarize data. But first, we need to introduce some basics about storing data in a relational database.
There are several strategies related to data storage when managing natural resource information. To illustrate different types of data we can picture it as a continuous that goes from very basic information level to more advanced data rich frameworks.
At the very bottom of the data organization leather is the observational data. It is characterized by multiple observations about a phenomenon. For example, we could have heights and diameters measured at a particular forest. We can assume these observations to be independent with identically distributed probabilities.(iid) but depending on the level of grouping we would like to make, we could observe data that is correlated within a region and uncorrelated between regions. Other examples are soil traits like particle size distribution, organic matter content, bulk density a.s.o. Also, we could have climatic observations for one particular day in different parts of the world, things like radiation, temperature, rainfall, wind speed. When dealing with insect populations we could have counts of insects infecting a given tree, with a male and female proportion, traits related to those trees, and some environmental traits at each location. In wildlife we could have counts of certain species at different locations, number of males and females a.s.o. These are all observational data. We could only hope to find a correlation between different variables in our observational data, and sometimes if we are lucky, we could test a correlation model based on some kind of rule (population dynamics, allometry, biophysics). This type of data is typically analyzed through either linear or nonlinear regression. When dealing with multiple variables, an analysis, like factor analysis, principal components, or any data reduction algorithm will give insight about the structure of the variation within the data. Finally, a more contemporary way to analyze vast amounts of observational data, without caring about their structure is to look at machine learning algorithms like k-nearest neighbors or artificial neural networks.
A second level of information is observational data carried over time. In this case, we would have repeated measurements over some period of either hours, days, months or years. Our focus in this case will be to look for trend changes. Typical examples involve forest inventories over time (permanent sample plots), weather trends, species counts over time, or bird surveys. This type of information provides an insight about processes related to the phenomenon in study. In such cases, we can develop trends over time, change over time and change as related to some other variable. This type of data provides the basis for population dynamics modeling. Types of data analysis involve either systems of differential equations, difference equations, algebraic equations or any method that helps describing the population over the range of the repeated measurement.
A third level of information is experimental data. In such case, we are interested in testing hypothesis beyond simple correlation. With experimental data, we could test for cause-effect relations. We are able to do so by carefully controlling our sources of variation to make meaningful comparisons between our treatments and a control. Here we can find any simple experiment with treatments applied over random samples replicated a certain number of times to ensure enough power for the experiment.
This type of data is very common in forest experiments. We call it longitudinal because we measure set of randomly assigned treatments over more than one time. So methods like repeated-measurement anovas are the common place to determine the effect of treatments over time. We could analyze longitudinal experimental data using a modeling framework that involves change over time, given treatments. However, in such case we would be trying to summarize our experimental findings into useful tools, allowing us to extrapolate our experimental results to different areas.
We are going to start applying some basic analysis types first over observational data. In this case, our data set (eucalyptus.csv) is an Eucalyptus globulus inventory. The data set is comprised of four plots. Here we have measurements about tree diameter, heights and tree volume for each plot. Our goal is to be able to summarize this data creating an estimate of volume per acre, determine diameter-height relations and a simple volume equation to be applied under other circumstances. We use the names function to determine column names in the database, str, to determine the type of data stored in the database and head to display the first 6 rows in the database.
my_inventory <- read.csv("./DATA/Eucalyptus_Inventory.csv") names(my_inventory)
## [1] "SITE" "STAND" "VOL" "TREENUMBER" "DAC" ## [6] "DBH" "HEIGHT" "PLOTSIZE" "VIGOR"
str(my_inventory)
## 'data.frame': 400 obs. of 9 variables: ## $ SITE : Factor w/ 1 level "Log Hills": 1 1 1 1 1 1 1 1 1 1 ... ## $ STAND : int 1 1 1 1 1 1 1 1 1 1 ... ## $ VOL : num NA 0.0991 0.086 0.084 0.0692 ... ## $ TREENUMBER: int 1 2 3 4 5 6 7 8 9 10 ... ## $ DAC : num NA 17.7 17 16.7 15.3 ... ## $ DBH : num NA 14.5 13.9 13.7 12.7 16.4 13.9 NA 12.3 13.1 ... ## $ HEIGHT : num NA 18 17 17.1 16.4 18.8 17.4 NA NA 17 ... ## $ PLOTSIZE : int 400 400 400 400 400 400 400 400 400 400 ... ## $ VIGOR : int 11 NA NA NA 8 NA NA 1 3 13 ...
head(my_inventory)
## SITE STAND VOL TREENUMBER DAC DBH HEIGHT PLOTSIZE ## 1 Log Hills 1 NA 1 NA NA NA 400 ## 2 Log Hills 1 0.09907789 2 17.73660 14.5 18.0 400 ## 3 Log Hills 1 0.08598977 3 16.95682 13.9 17.0 400 ## 4 Log Hills 1 0.08402442 4 16.66434 13.7 17.1 400 ## 5 Log Hills 1 0.06924996 5 15.31717 12.7 16.4 400 ## 6 Log Hills 1 0.13237739 6 20.46435 16.4 18.8 400 ## VIGOR ## 1 11 ## 2 NA ## 3 NA ## 4 NA ## 5 8 ## 6 NA
We are going to start inspecting some of the relations in this data base using graphical displays. There are two very popular packages to do so: lattice and ggplot2. We are going to use the later. The function builds graphics on a step by step procedure. First, you have to call the function providing the data source argument and a ahesthetics (aes) argument, specifying what should be in your x and y coordinates.
library(ggplot2) ggplot(data = my_inventory, aes(x=DBH, y=HEIGHT)) + geom_point()
## Warning: Removed 38 rows containing missing values (geom_point).
We could have assigned the result for the ggplot, creating a new object. The following yields the same results
my_plot <- ggplot(data = my_inventory, aes(x=DBH, y = HEIGHT)) my_plot + geom_point()
## Warning: Removed 38 rows containing missing values (geom_point).
This is a somewhat more complex formulation than the simple plot function, however, the functionality will become evident soon. The first argument of the function refers to the data frame, then we specify what the x and y variables should be. The + sign is important, and it's telling ggplot there is more to come. Finally, geom_point() indicates to plot the points in the graph. Notice the scales are not labeled with units, nor can we identify what the different series are. Therefore, we are going to try this again, adding colors to the symbols.
ggplot(data = my_inventory, aes(x=DBH, y = HEIGHT, color = STAND)) + geom_point()
## Warning: Removed 38 rows containing missing values (geom_point).
Now we can see different colors for each one of the four stands. However, the scale is somewhat weird. This is because the STAND variable is an ordinal class but is interpreted as a numeric variable. If we turn it into a factor we will see the difference. We do that telling R to interpret the STAND field as a factor using the as.factor function.
my_inventory$STAND <- as.factor(my_inventory$STAND) ggplot(data = my_inventory, aes(x=DBH, y = HEIGHT, color = STAND)) + geom_point()
## Warning: Removed 38 rows containing missing values (geom_point).
It is tough to see all stands this way. We might as well arrange every stand in a different plot. So instead of assigning a color to each stand, we will arrange each STAND side by side using the STAND number as the header for each plot. In order to do so, we delete the color argument inside the aes, next we add a \verb,+, sign to tell R that there is more to come after geom_point() and add the facet_grid function call. Inside \verb,facet_grid, we specify the variable that divides the data set (STAND in this case), presided by a ~ sign.
ggplot(data= my_inventory, aes(x=DBH, y = HEIGHT))+ geom_point() + facet_grid(~STAND)
## Warning: Removed 38 rows containing missing values (geom_point).
we can select the observations we want sub setting the database for a given trait. The subset function helps us in filtering the data set to keep only the records we want. For example, let's say we want to work only with those records whose VIGOR observation is different from 1 (dead in this database). We can inspect the number of records whose VIGOR is equal to 1 creating a table with the table function. Next, we create a new data frame with this traits. The syntax is as follows:
table(my_inventory$VIGOR)
## ## 1 2 3 5 7 8 9 11 12 13 15 16 ## 27 4 8 3 6 16 7 2 1 41 11 3
my_inventory2 <- subset(my_inventory, VIGOR!=1)
You can see the table function produces a count output for each value in the VIGOR field. Next, we have created the new data frame (my_inventory2) and we plot it as we did before. In this case, notice that we replace the data argument from \verb,my_inventory, to my_inventory2. The subset function takes two arguments, the data frame and Boolean comparison. In our case, we are selecting only those records that are different from VIGOR with 1,. The following table gives a list of comparison operators. Some of this conditions can be concatenated using the or (|) or and (&). For example, all the trees whose VIGOR value is equal to 1 or equal to 3 would give the expression: VIGOR == 1 | VIGOR == 3. We can also specify more than one field. All trees whose HEIGHT is smaller than 15 with VIGOR different from 13 will be: HEIGHT < 15 & VIGOR != 13,. Remember that the equal sign has to be typed twice (==), otherwise we are assigning a value, and that is always going to be TRUE. Subsetting with the subset function is one way to select portions of the data frame. Another way is indexing. With indexing we have to specify the indices of the components of the data frame we want to select. Here we can apply the square bracket notation that we learned when managing matrices. The first argument inside the square brackets indicate the index or list of indices for every selected row. The second argument after the comma indicate the column or columns numbers. For example, a subset with the first three records and first two columns will be written as
my_inventory[1:6,1:3]
## SITE STAND VOL ## 1 Log Hills 1 NA ## 2 Log Hills 1 0.09907789 ## 3 Log Hills 1 0.08598977 ## 4 Log Hills 1 0.08402442 ## 5 Log Hills 1 0.06924996 ## 6 Log Hills 1 0.13237739
This is equivalent to using my_inventory(c(1,2,3,4,5),c(1,2,3)). Likewise, we could use the variable names to refer to certain columns: my_inventory[,c("DBH", "HEIGHT)].
Leaving the first argument empty will select all the records in this case. If we want to select using conditional arguments, we could use:
my_inventory[my_inventory$VIGOR != 1 & my_inventory$DBH<15, ]
When we have very long expressions, there are two shortcuts. One is using the attach - detach set of functions. If we \textit{attach} a data frame, we are able to access all it's fields without having to specify the data frame name. Once we have finished working with that data frame, we use detach to close the environment.
attach(my_inventory) mean(HEIGHT)
## [1] NA
sd(HEIGHT)
## [1] NA
detach(my_inventory)
The other form to refer to the environment is to use the with function. This is one we might use for a short set of instructions (like conditional sub-setting).
my_inventory3 <- with(my_inventory, my_inventory[VIGOR != 1 & DBH <15, ])
Instead of using an inclusion index, we could also specify a negative subset or we could use the ! operator.
my_inventory3 <- my_inventory[-c(1:100), ] my_inventory3 <- my_inventory[!my_inventory$VIGOR == 1, ]
ggplot(data= my_inventory2, aes(x=DBH, y = HEIGHT))+ geom_point()+ facet_grid(~STAND)
## Warning: Removed 11 rows containing missing values (geom_point).
Now we can see different color classes for each one of the different stands. We can summarize this data in a graph using a box and whiskers for each STAND. Notice how we change the function from geom_point to geom_boxplot. The function requires us to specify a classifying variable in the x-axis.
ggplot(data= my_inventory, aes(x=STAND, y = HEIGHT))+ geom_boxplot()
## Warning: Removed 38 rows containing non-finite values (stat_boxplot).
The box-and-wiskers plot indicates the median, first and second quartile, as well as outliers present in those data sets. Now that we have learned how to store and manipulate data between the wide and long format, we would like to start summarizing our data frames. You could check some simple statistics by using the apply, family of functions: sapply and lapply.
sapply(my_inventory, FUN = mean, na.rm = TRUE)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical: ## returning NA ## Warning in mean.default(X[[i]], ...): argument is not numeric or logical: ## returning NA
## SITE STAND VOL TREENUMBER DAC ## NA NA 0.06656495 50.50000000 14.85326396 ## DBH HEIGHT PLOTSIZE VIGOR ## 12.19019074 15.30718232 400.00000000 8.42635659
lapply(my_inventory, FUN = mean, na.rm = TRUE)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical: ## returning NA ## Warning in mean.default(X[[i]], ...): argument is not numeric or logical: ## returning NA
## $SITE ## [1] NA ## ## $STAND ## [1] NA ## ## $VOL ## [1] 0.06656495 ## ## $TREENUMBER ## [1] 50.5 ## ## $DAC ## [1] 14.85326 ## ## $DBH ## [1] 12.19019 ## ## $HEIGHT ## [1] 15.30718 ## ## $PLOTSIZE ## [1] 400 ## ## $VIGOR ## [1] 8.426357
Both produce summaries for all columns in a data frame. Notice that sapply returns a vector, and lapply returns a list. Other functions that summarize data are described in the following table: We can't work this way using the long format, in that case, we will need to utilize the aggregate function:
aggregate(x = list(VOL = my_inventory$VOL), by = list(STAND = my_inventory$STAND), FUN = "mean", na.rm = TRUE)
## STAND VOL ## 1 1 0.07617657 ## 2 2 0.06560707 ## 3 3 0.06718123 ## 4 4 0.05854375
The first argument tells aggregate the list of response variables to summarize (can be more than one). Notice how we label the response variable as VOL.The second argument is the list of variables to break our table apart. For this case, we only have one variable: STAND. Finally, we tell the function we want to use to summarize the data mean. We can specify other arguments after this like na.rm = TRUE to tell the function to skip NA values. Sometimes, we will have other variables like location or region. We could summarize by any grouping factor just adding them to the by argument, for example:
...
by = list(LOC = my_inventory$LOCATION,
STAND = my_inventory$STAND),
...
There is an even more powerful method to summarize your data. Just like you would do with a dynamic table in Microsoft Excel, we could summarize several fields at the same time or get different summary statistics for the same variable. There is a special package that needs to be loaded; one that is the most popular package for R: plyr. Using the same table we defined above, we could summarize as follows:
library(plyr) ddply(my_inventory, .(STAND), summarize, meanDBH = mean(DBH, na.rm= TRUE), maxDBH = max(DBH, na.rm = TRUE), sdDBH = sd(DBH, na.rm = TRUE), minHT = min(HEIGHT, na.rm = TRUE), maxHT = max(HEIGHT, na.rm = TRUE), sdHT = sd(HEIGHT, na.rm = TRUE))
## STAND meanDBH maxDBH sdDBH minHT maxHT sdHT ## 1 1 12.78161 17.8 2.644632 6.8 19.1 2.139888 ## 2 2 11.79765 19.3 3.946547 6.2 19.1 3.805569 ## 3 3 12.43814 18.2 2.790290 7.9 18.0 2.085369 ## 4 4 11.76020 16.2 1.945746 5.8 18.3 1.683185
The simplest longitudinal data example is the one presented in a previous lecture with the CO_2 data (\verb,my_CO2.ManuaLoa,). In order to plot that data set with ggplot we proceed as in the previous section, but instead of adding a geom_points we change that for \verb,geom_line, and we introduce how to add graph labels as well with the.
my_CO2.ManuaLoa <- read.table("./DATA/co2_mm_mlo.txt", skip = 73) names(my_CO2.ManuaLoa) <- c("YEAR", "MONTH", "Year.month", "CO2", "CO2.spl","TREND", "N.Days") plt_CO2 <- ggplot(my_CO2.ManuaLoa, aes(x = Year.month, y= CO2.spl)) plt_CO2 + geom_line() + xlab("Year") + ylab("CO2 concentration")
One type of analysis done with longitudinal data is finding out about the data trend. In our case, it is pretty obvious, but we are going to add the trend anyway, using the geom_smooth function.
plt_CO2 + geom_line() + geom_smooth() + xlab("Year") + ylab("CO2 concentration")
A more interesting relation would be to see years where the increase in CO_2 increases. For that, we need to summarize our data on a yearly basis. In our case, we will calculate the average CO_2 concentration level on a yearly basis. Next, we are going to calculate a difference between measurement t+i and t. We accomplish this using the diff function.
my_CO2.yearly <- aggregate(list(CO2.year = my_CO2.ManuaLoa$CO2.spl), by = list(year = my_CO2.ManuaLoa$YEAR), mean, na.rm = TRUE) my_CO2.yearly$CO2.diff <- c(0, diff(my_CO2.yearly$CO2.year)) plt_CO2 <- ggplot(my_CO2.yearly, aes(x = year, y= CO2.diff)) plt_CO2 + geom_line() + geom_smooth() + xlab("Year") + ylab("CO2 concentration change rate")
Now we have uncovered some interesting trait from our data. We can see that between 1960 and late 1970's, the rate of CO_2 change was increasing almost at the same steady rate. Between 1980 and late 1990, the rate of increase stayed almost flat, but between 2000 and 2016, we can see a steady increase in the rate of change. Uncovering information about the rate of change we can learn much more about our phenomena. The blue line with gray bands is a smoothing value. Looking at the real change (black lines) we see some years with decreasing increase rate. No single year the increase rate falls bellow zero, therefore, we can conclude no single year the CO_2 concentration has been reduced. We can observe also that some years the rate of increase has been smaller than others. Sometimes we would like to explore what happens over several years in our series. One very powerful tool to exploratory look at longitudinal series are filters. The most basic type of filter is a moving windows average. This will transform (or filter) or series to produce a summary based on either past, future or current observations. The linear moving window filter for a series with $k$ elements, looking at backward moving average is defined as follows. when calculating successive values, new values will come into the sum, dropping old values out of the range. In R, we use the filter function to accomplish this.
filter(x = c(5,10,8,4,2), filter = c(.5,.5))
## Time Series: ## Start = 1 ## End = 5 ## Frequency = 1 ## [1] 7.5 9.0 6.0 3.0 NA
The very basic type will be a trial with replicated data. In the following example, we have a forest trial with four treatments and 5 replicates. Our response variable is volume. Despite the simplicity of this experiment, R would have a hard time trying to understand this type of information. We would have to turn a table like this into a longitudinal arrangement. In such a table, instead of having replicates ordered on the wide side, we will have just one field indicating the treatment, one indicating the replicate number and one keeping the information. This might sound like a waste of space, (and it is!), but since R processes all the information in a sequential way, we have to specify the sequence we are going to be analyzing that data. In this case, Treatments, Replicates and Response. There is a quick way to switch between one data form and the other. For that we can use the reshape2 library with functions melt and cast. Say we have created a data frame in the long format with name \verb,my_trial,; to turn it into the wide format we would need to use the cast function.
library(reshape2) my_trial <- data.frame(Treatment = c("T0", "T1","T2","T3"), Rep1 = c(150,155,150,170), Rep2 = c(147,159,152,175), Rep3 = c(155,159, 149,180), Rep4 = c(145,162,139,182), Rep5 = c(139,157,155,185)) my_trial2 <- melt(data = my_trial, variable.name = "Replicate", value.name = "Response" )
## Using Treatment as id variables
my_trial2
## Treatment Replicate Response ## 1 T0 Rep1 150 ## 2 T1 Rep1 155 ## 3 T2 Rep1 150 ## 4 T3 Rep1 170 ## 5 T0 Rep2 147 ## 6 T1 Rep2 159 ## 7 T2 Rep2 152 ## 8 T3 Rep2 175 ## 9 T0 Rep3 155 ## 10 T1 Rep3 159 ## 11 T2 Rep3 149 ## 12 T3 Rep3 180 ## 13 T0 Rep4 145 ## 14 T1 Rep4 162 ## 15 T2 Rep4 139 ## 16 T3 Rep4 182 ## 17 T0 Rep5 139 ## 18 T1 Rep5 157 ## 19 T2 Rep5 155 ## 20 T3 Rep5 185
Notice that melt uses three arguments as part of the function call: the data frame name, a name to give to the values in each column and the name for the response variable.Now we are ready to analyze the data. However, we might like to perform calculation sin a wide format. In that case, we can use the dcast function in the following way:
my_trial3 <- dcast(data = my_trial2, formula = Treatment~Replicate)
## Using Response as value column: use value.var to override.
my_trial3
## Treatment Rep1 Rep2 Rep3 Rep4 Rep5 ## 1 T0 150 147 155 145 139 ## 2 T1 155 159 159 162 157 ## 3 T2 150 152 149 139 155 ## 4 T3 170 175 180 182 185
Notice what happens when you invert the formula:
my_trial4 <- dcast(data = my_trial2, formula = Replicate~Treatment)
## Using Response as value column: use value.var to override.
my_trial4
## Replicate T0 T1 T2 T3 ## 1 Rep1 150 155 150 170 ## 2 Rep2 147 159 152 175 ## 3 Rep3 155 159 149 180 ## 4 Rep4 145 162 139 182 ## 5 Rep5 139 157 155 185
Now we are going to run the very simple anova on this type of data. The funcation is aov. This function takes an expression as the base.
my_anova <- aov(Response ~ Treatment, my_trial2)
To get summaries from this analysis we need to call the summary function again.
summary(my_anova)
## Df Sum Sq Mean Sq F value Pr(>F) ## Treatment 3 3068.6 1022.9 35.95 2.44e-07 *** ## Residuals 16 455.2 28.4 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We know this experiment is a randomized complete block, therefore, we will need to run the anova to reflect that. In such case, our call to the aov will be:
my_anova.rcbd2 <- aov(Response ~ Treatment+ Error(Replicate), my_trial2) summary(my_anova.rcbd2)
## ## Error: Replicate ## Df Sum Sq Mean Sq F value Pr(>F) ## Residuals 4 49.5 12.38 ## ## Error: Within ## Df Sum Sq Mean Sq F value Pr(>F) ## Treatment 3 3068.6 1022.9 30.25 7.05e-06 *** ## Residuals 12 405.7 33.8 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Notice one thing. This type of anova is giving us a p value that is significative, as long as one of the comparisons is different from our control treatment. But, how does R know wish one is the control treatment? In that case, we have to use the contrasts function. The function takes a data.frame vector, and returns the default contrast matrix for that particular vector.
my_trial2$Treatment <- factor(my_trial2$Treatment) contrasts(my_trial2$Treatment)
## T1 T2 T3 ## T0 0 0 0 ## T1 1 0 0 ## T2 0 1 0 ## T3 0 0 1
You can see that every treatment is going to be compared against T0 using a 0-1 coding. This can change if we run the treatments as follow:
contrasts(my_trial2$Treatment) <- contr.treatment(4, base = 2 ) contrasts(my_trial2$Treatment)
## 1 3 4 ## T0 1 0 0 ## T1 0 0 0 ## T2 0 1 0 ## T3 0 0 1
We might have used a different type of contrast
contrasts(my_trial2$Treatment) <- contr.helmert(4) contrasts(my_trial2$Treatment)
## [,1] [,2] [,3] ## T0 -1 -1 -1 ## T1 1 -1 -1 ## T2 0 2 -1 ## T3 0 0 3