4. Managing Natural Resource Data and Advanced R programming

Now that we have learned the basics about data types and data structures and the meaning of data merging we are going to do introduce advanced data manipulations commonly found in natural resource management. In this chapter, we are going to learn different strategies on how to summarize data. But first, we need to introduce some basics about storing data in a relational database.

Types of natural resource data

There are several strategies related to data storage when managing natural resource information. To illustrate different types of data we can picture it as a continuous that goes from very basic information level to more advanced data rich frameworks.

Observational data

At the very bottom of the data organization leather is the observational data. It is characterized by multiple observations about a phenomenon. For example, we could have heights and diameters measured at a particular forest. We can assume these observations to be independent with identically distributed probabilities.(iid) but depending on the level of grouping we would like to make, we could observe data that is correlated within a region and uncorrelated between regions. Other examples are soil traits like particle size distribution, organic matter content, bulk density a.s.o. Also, we could have climatic observations for one particular day in different parts of the world, things like radiation, temperature, rainfall, wind speed. When dealing with insect populations we could have counts of insects infecting a given tree, with a male and female proportion, traits related to those trees, and some environmental traits at each location. In wildlife we could have counts of certain species at different locations, number of males and females a.s.o. These are all observational data. We could only hope to find a correlation between different variables in our observational data, and sometimes if we are lucky, we could test a correlation model based on some kind of rule (population dynamics, allometry, biophysics). This type of data is typically analyzed through either linear or nonlinear regression. When dealing with multiple variables, an analysis, like factor analysis, principal components, or any data reduction algorithm will give insight about the structure of the variation within the data. Finally, a more contemporary way to analyze vast amounts of observational data, without caring about their structure is to look at machine learning algorithms like k-nearest neighbors or artificial neural networks.

Longitudinal Observational Data

A second level of information is observational data carried over time. In this case, we would have repeated measurements over some period of either hours, days, months or years. Our focus in this case will be to look for trend changes. Typical examples involve forest inventories over time (permanent sample plots), weather trends, species counts over time, or bird surveys. This type of information provides an insight about processes related to the phenomenon in study. In such cases, we can develop trends over time, change over time and change as related to some other variable. This type of data provides the basis for population dynamics modeling. Types of data analysis involve either systems of differential equations, difference equations, algebraic equations or any method that helps describing the population over the range of the repeated measurement.

Experimental Data

A third level of information is experimental data. In such case, we are interested in testing hypothesis beyond simple correlation. With experimental data, we could test for cause-effect relations. We are able to do so by carefully controlling our sources of variation to make meaningful comparisons between our treatments and a control. Here we can find any simple experiment with treatments applied over random samples replicated a certain number of times to ensure enough power for the experiment.

Longitudinal Experimental Data

This type of data is very common in forest experiments. We call it longitudinal because we measure set of randomly assigned treatments over more than one time. So methods like repeated-measurement anovas are the common place to determine the effect of treatments over time. We could analyze longitudinal experimental data using a modeling framework that involves change over time, given treatments. However, in such case we would be trying to summarize our experimental findings into useful tools, allowing us to extrapolate our experimental results to different areas.

Displaying and summarizing observational data

We are going to start applying some basic analysis types first over observational data. In this case, our data set (eucalyptus.csv) is an Eucalyptus globulus inventory. The data set is comprised of four plots. Here we have measurements about tree diameter, heights and tree volume for each plot. Our goal is to be able to summarize this data creating an estimate of volume per acre, determine diameter-height relations and a simple volume equation to be applied under other circumstances. We use the names function to determine column names in the database, str, to determine the type of data stored in the database and head to display the first 6 rows in the database.

my_inventory <- read.csv("./DATA/Eucalyptus_Inventory.csv") names(my_inventory)

## [1] "SITE" "STAND" "VOL" "TREENUMBER" "DAC" ## [6] "DBH" "HEIGHT" "PLOTSIZE" "VIGOR"

str(my_inventory)

## 'data.frame': 400 obs. of 9 variables: ## $ SITE : Factor w/ 1 level "Log Hills": 1 1 1 1 1 1 1 1 1 1 ... ## $ STAND : int 1 1 1 1 1 1 1 1 1 1 ... ## $ VOL : num NA 0.0991 0.086 0.084 0.0692 ... ## $ TREENUMBER: int 1 2 3 4 5 6 7 8 9 10 ... ## $ DAC : num NA 17.7 17 16.7 15.3 ... ## $ DBH : num NA 14.5 13.9 13.7 12.7 16.4 13.9 NA 12.3 13.1 ... ## $ HEIGHT : num NA 18 17 17.1 16.4 18.8 17.4 NA NA 17 ... ## $ PLOTSIZE : int 400 400 400 400 400 400 400 400 400 400 ... ## $ VIGOR : int 11 NA NA NA 8 NA NA 1 3 13 ...

head(my_inventory)

## SITE STAND VOL TREENUMBER DAC DBH HEIGHT PLOTSIZE ## 1 Log Hills 1 NA 1 NA NA NA 400 ## 2 Log Hills 1 0.09907789 2 17.73660 14.5 18.0 400 ## 3 Log Hills 1 0.08598977 3 16.95682 13.9 17.0 400 ## 4 Log Hills 1 0.08402442 4 16.66434 13.7 17.1 400 ## 5 Log Hills 1 0.06924996 5 15.31717 12.7 16.4 400 ## 6 Log Hills 1 0.13237739 6 20.46435 16.4 18.8 400 ## VIGOR ## 1 11 ## 2 NA ## 3 NA ## 4 NA ## 5 8 ## 6 NA

We are going to start inspecting some of the relations in this data base using graphical displays. There are two very popular packages to do so: lattice and ggplot2. We are going to use the later. The function builds graphics on a step by step procedure. First, you have to call the function providing the data source argument and a ahesthetics (aes) argument, specifying what should be in your x and y coordinates.

library(ggplot2) ggplot(data = my_inventory, aes(x=DBH, y=HEIGHT)) + geom_point()