Week 9: Statistical analysis with R --- the "for the faint-of-heart" edition

Welcome to week 9! You should be quite familiar with R by now. Today, we teach you to overcome one of the most frustrating aspects of conducting analyses with R-- namely, how to access outputs from statistical analyses for post-processing and table creation. We illustrate the approaches to extracting and manipulating objects using linear models because they should be the most familiar techniques to most of you. But don't worry- you will still find this useful even if you haven't had linear regression yet. Here's a script with the code for all of today's lesson.

LINEAR MODELING WITH R

Before today's lesson, download the following 2 comma separated text files:

Fish data.csv: contains lamprey collection data from 65 stream sections as part of a population survey. The file contains the following data:

Depth.m - depth of sample unit in meters

Velocity.ms - current velocity in m/s

Season - season sample was collected

Sample.area.m2 - sample unit area in meters squared

No.caught- the number of lamprey caught 

Westslope.csv: contains presence and absence data from streams in watersheds within Interior Columbia River Basin 

The file contains the following data:

PRESENCE- species presence (1) or absence(0)

WSHD - Watershed ID

SOIL_PROD - percent of watershed with productive soils

GRADIENT- gradient (%) of the stream

WIDTH- Mean width of the stream in meters


Don’t forget to set your working directory before you import the two files!

# set working directory

setwd("C:/Users/Jim/Desktop") #set working directory

# read in fish data and create data frame fish

fish<- read.csv("Fish data.csv")

# what’s in the data frame

head(fish)

# always a good idea to do a quick summary of the data before analyses

summary(fish)

# always a good idea to calculate correlation between potential independent

# variables in a regression here we do depth and velocity column 1 and 2

cor(fish[,1:2])

We want to analyze fish density (lamprey per unit area) so we first transform the data and create a new variable

## create density in fish data frame

fish$density.no.m2 <- with(fish,No.caught/Sample.area.m2)

We can conduct linear regression using several built in R functions. The first we will use is lm(). The syntax for lm() is: response variable, the curly sign (~) which means as a function of then the linear equation using the names of the independent variables, followed by the name of the data frame. For example, let’s fit a model to see the relation between depth current velocity and lamprey density.

## model density as an additive function of depth and velocity

lm(density.no.m2~ Depth.m + Velocity.ms,data = fish)

We can create objects for all R statistical functions and conduct summaries on the objects. For example:

#create mod.out to contain the results of the linear regression

mod.out = lm(density.no.m2~ Depth.m + Velocity.ms,data = fish)

#summarize the output, notice the increased information in the output

summary(mod.out)

We also can enclose the model and the creation of the R object containing the output in a single step. Warning- this is one of the instances where = and <- are not treated the same.

## This will not work

summary(mod.out = lm(density.no.m2~ Depth.m + Velocity.ms,data = fish))

## this works

summary(mod.out <- lm(density.no.m2~ Depth.m + Velocity.ms,data = fish))

It is always good practice to query the objects created by R functions and see what is in them. The lm() function creates an lm object, which is basically a list.

# what is the object we created

class(mod.out)

## what us inside of the object

str(mod.out)

The contents and format of the lm object should look familiar. It is very similar to an R list and we can access the elements in the lm object just like we did in lists. For example,

# to extract the model coefficients (parameter estimates)

mod.out$coefficients

## we can extract the coefficients using shorthand

mod.out$coef

There also are built in functions that extract the objects that we want. For example,

# extract model coefficients

coef(mod.out)

## create 95% confidence intervals of the coefficients (parameter estimates)

confint(mod.out, level = 0.95)

On occasion, it can be a pain to get standard errors from lm objects. One guaranteed way to do it is to extract the variance covariance matrix of any linear model (lm, glm, and other functions) using vcov(), grab the diagonal of the matrix and take the square root to get the standard errors.

# get standard errors for coefficients of model

sqrt(diag(vcov(mod.out)))

We can extract several other useful objects such as the residuals, the residual standard error, and predicted values.

# get residuals from mod.out

mod.out$residuals

# shorthand for residuals

mod.out$resid

# predicted (fitted) values of observations

mod.out$fitted.values

# shorthand for accessing predicted values

mod.out$fitted

We can use the data in the lm object to conduct some analyses. For example, we can conduct model diagnostics by examining residual plots.

# plot residual vs. predicted values

plot(mod.out$resid~mod.out$fitted)

# plot residuals vs. velocity notice two different data sources

plot(mod.out$resid~fish$Velocity.ms)

# plot residuals vs. depth notice two different data sources

plot(mod.out$resid~fish$Depth.m)

The last plot should have caught your attention. It showed a curved relationship between the residuals and depth. This suggest that we may need to include a quadratic (squared) term in the regression. There are two basic ways of including quadratic terms in R statistical functions. The first is to create a new variable in the data frame which is the variable squared. For example,

fish$Depth.m.sq <- fish$Depth.m^2

#and use the new variable in the regression model

summary(mod.out <- lm(density.no.m2~ Depth.m+ Depth.m.sq + Velocity.ms, data = fish))

A more efficient was is to use the I() function inside the lm() function. The I() function tells R that an operation needs to be performed inside the parenthesis before the model is fit.

summary(mod.out <- lm(density.no.m2~ Depth.m+ I(Depth.m^2) + Velocity.ms, data = fish))

You can use I() in many other functions

Now let’s examine our residuals to see if we’ve taken care of the problem.

plot(mod.out$resid~mod.out$fitted)

plot(mod.out$resid~fish$Depth.m)

# normal probability plot of residuals, should resemble a straight line if model fit is OK

qqnorm(mod.out$resid)

Many times we may want to examine interactions between variables in statistical analyses. There are two ways to specify these interactions using an asterisk (*) and a colon (:)

# the asterisk indicates main effects and interactions between 2 or more variables

summary(mod.out <- lm(density.no.m2~ Depth.m*Velocity.ms, data = fish))

# a colon means include the interaction between 2 or more variables

summary(mod.out <- lm(density.no.m2~ Velocity.ms + Depth.m:Velocity.ms, data = fish))

Notice above that the model included the interaction between depth and velocity but not the main depth effect. This is because we used the colon to specify the interaction. We're not sure how to interpret the parameter in this instance, so make sure you know what you are doing—statistically speaking.

In the summary of fish data frame, you should have noticed that season was a categorical predictor variable (a factor) that consisted of 3 levels spring, summer, and fall. Most R statistical functions can handle factor (categorical variables) by automatically creating binary indicator variables in the design matrix.

WHAT the #$@%^&* did you just say there?  If we have time we will explain, suffice it to say that R chooses a baseline category (factor) by default usually in alphabetical order. Run the following and examine the output.

summary(mod.out <- lm(density.no.m2~ Depth.m+ I(Depth.m^2) + Velocity.ms + Season, data = fish))

Fall was the first category alphabetically, so it was selected as the baseline in lm. The parameter estimates for spring and summer indicate how much lamprey density differed between spring and fall and summer and fall, respectively. There is a function for setting the order of the categorical variables—but we will not show it to you here. Rather, we suggest you create your own binary indicator variables using our friend the ifelse() function. Below we create a variable “fall”, which is = 1 if the season is fall and is zero of it is summer or spring.

# create binary indicator variable

fish$Fall = ifelse(fish$Season == "fall",1,0)

We can then use this variable in the model just like any other variable. Here the parameter estimate is an estimate of how much densities differed in the fall compared to spring and summer (model assumes spring and summer densities were not different).

summary(mod.out <- lm(density.no.m2~ Depth.m + I(Depth.m^2) + Velocity.ms + Fall, data = fish))

But what about other statistical tests? Do these tricks/functions work for those? Why yes. Lets try with something difefrent. t-tests are cherished time-tested statistical tests for evaluating differences between groups. Here lets test whether fish density is statistically different between fall and all other months. To invoke a t-test we use the functuion t.test:

## test for differences in fish density put results in t_tst.result

t_tst.result<-t.test(density.no.m2~Fall, data = fish)

t_tst.result

Welch Two Sample t-test  data:  density.no.m2 by Fall t = -0.34238, df = 31.598, p-value = 0.7343 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval:  -0.07015146  0.04997057 sample estimates: mean in group 0 mean in group 1        0.3571669       0.3672574

Here the default is a test that assumes unequal variances of the 2 groups. Here the differences are not significant and the confidence intervals contain zero. I wonder how we extract the values from the object t_tst.result? Maybe the str function will work.

str(t_tst.result)

### use $ notation to access p value

t_tst.result$p.value

## use $ notation to get the confidence intervals

t_tst.result$conf.int

## kind of sloppy lets just grab to first 2 elements

t_tst.result$conf.int[1:2]

There are krillions of functions and packages to conducting statistical analyses with R. For example, multiple functions can be used to fit linear models. glm() fits several types of generalized linear models, including regular (normal) linear regression that we did using lm. The syntax for glm() is similar to lm(). For example,

## now we try a different function glm that fits generalized linear models

summary(new.out <- glm(density.no.m2~ Depth.m + I(Depth.m^2) + Velocity.ms + Fall,data = fish))

## let’s see what's in the output list

str(new.out)

glm() function creates a glm object that is very similar to the object created with lm(). Can you identify the differences? There are several. One difference is that glm() provides us with AIC (Akaike Information Criteria) that can be used to determine the best predicting model. The data we have are relatively few, so we should be using AIC with a small sample bias adjustment AICc. To calculate AICc, we need AIC, the number of parameters in the model (stats heads call this the model rank) and the number of observations in the data set. If we look at the contents of new.out, we should be able to find most of these items. For example.

## first save AIC

aic = new.out$aic

## AICc requires number of observations in data we use length() to get it

n = length(new.out$fitted)

## we also need the number of parameters in the model

## this is termed the rank we add 1 for the residual error

K = new.out$rank + 1

## here's how we calculate AICc

AIC.c <- aic + (2*K*(K+1))/(n-K-1)

## let’s put these in a vector

mod.sel = c(n,K,AIC.c)

#for grins lets fit another model without fall in it

new2.out <- glm(density.no.m2~ Depth.m + I(Depth.m^2) + Velocity.ms,data = fish)

## lets calculate AICc for this model too notice that we repeat the same commands as above

aic = new2.out$aic

n = length(new2.out$fitted)

K = new2.out$rank + 1

AIC.c2 <- aic + (2*K*(K+1))/(n-K-1)

## combine the AICc statistics for both models

rbind(mod.sel,c(n,K,AIC.c2))

That was pretty neat (ok…we’re geeks). By now, you should be able to see that it would be possible to create a function that extracts the elements necessary and calculates AICc.

The glm() function also fits other types of generalized linear models, i.e., linear models with other types of statistical distributions. Regular linear regression, i.e., the type of regression performed by lm() function, assumes a normal distribution which is often referenced as gaussian due to the use of the Gaussian function, see background here. Here’s how we would specify a normal distribution with glm(), unnecessary because gaussian is the default.

## specify distribution using family option in glm

summary(glm(density.no.m2~ Depth.m + I(Depth.m^2) + Velocity.ms,data = fish, family = gaussian))

This should produce results identical to above. If we look back at the fish data frame we see that the original response was No.caught, which are integers (whole numbers), e.g.,

head(fish)

We can model count (integer) data with GLMs using a Poisson statistical distribution. To do this we specify family = poisson in the glm() function.

## we specify this in glm using family = poisson

summary(pois.reg<-glm(No.caught ~ Depth.m + I(Depth.m^2) + Velocity.ms + Sample.area.m2,data = fish, family = poisson))

We can examine the object created to hold the modeling information using the str() function.

## what's in the output, nothing new

str(pois.reg)

There is nothing different in the object from the normal linear regression with the exception of a family = poisson in the output. Just like normal linear regression, we can calculate AICc and examine the residuals using the $ syntax to access information in the object. For example, we can examine regression assumptions using a normal quantile-quantile (Q-Q) plot. If we meet model assumptions, the plot it should look similar to a straight line.

## normal quanntile plot of residuals

qqnorm(pois.reg$resid)

We can find out what other types of models glm fits using the help() function.

help(glm)

The help file indicates that we can fit several different types of GLMs. Of the remaining types (not used yet), logistic regression (family = binomial) is among the most widely used in ecological applications.  To illustrate logistic regression, read in the westslope cutthroat trout presence/absence data.

# read in data create trout data frame

trout<-read.csv("Westslope.csv")

# examine contents of trout

head(trout)

The response variable (dependent variable) in the data is PRESENCE (1 = present, 0 = absent) which is a binary response variable. Therefore, we need to use logistic regression and specify family = binomial in the glm() function.

## fit logistic regression model and create object logist.reg

summary(logist.reg<-glm(PRESENCE ~ SOIL_PROD + GRADIENT + WIDTH,data = trout, family = binomial))

## what is in the object

str(logist.reg)

Similar to the objects created when conducting normal and Poisson linear regression, logist.reg contains familiar elements, such as parameter estimates, model rank, AIC, and residuals. We also access these using the $ syntax. For example, the westslope cutthroat trout data were collected from multiple streams within 56 watersheds. We have reason to believe that there may be dependence (autocorrelation) among streams within watersheds. To evaluate this, we could plot the residuals ordered by watershed number.

plot(logist.reg$resid~trout$WSHD)

The plot suggests that some watersheds have residuals that are much higher or lower than others. Let’s create a box plot to make this much clearer.

boxplot(logist.reg$resid~trout$WSHD)

These plots suggest strong spatial dependence among sites within watersheds. To account for the dependence, we could use hierarchical logistic regression. One useful package to fitting these models is lme4. Load the package.

require(lme4)

The lme4 package fits generalized hierarchical linear models (there are others too) using the lmer() function. The syntax for the function is similar to glm() with a notable addition for specifying randomly varying effects. The code below specifies a model similar to the one fit above with glm(), but (1|WSHD) indicates that the intercept varies among watersheds. This is often called a random effect. Copy and paste the code and submit to R.

## fit logistic regression with random effect output to H.logit

summary(H.logit <-glmer(PRESENCE ~ SOIL_PROD + GRADIENT + WIDTH + (1|WSHD),data = trout, family = binomial))

This output differs from the glm() output. You should see new headings Random effects and Fixed effects and some familiar items, such as AIC. If time permits, we can discuss these outputs. If not, you can learn more in FW544 or in several very useful books. Now it’s time to see what is in H.logit. As above, we use the str() function to examine the contents.

str(H.logit)

Woa, that was different! Did you notice the @? This is because lmer creates a S4 class object. The above were S3 objects. There are lots of neat things that can be done with S4 objects, which is one of the reasons why many new packages create S4 objects, such as marked and unmarked.  We won’t be able to go into all of the features, so let’s just figure out how to access the elements. We use the $ in S3 objects, can we can use the @ to access S4 objects? For example.

# to access the residuals??

H.logit$resid

Woa, doesn't work but we also can use familiar functions to access these elements. For example,

# use fitted function to grab predicted values

fitted(H.logit)

#use resid function to grab residuals

resid(H.logit)

We can also use the @ syntax within functions. Below we plot the residuals by watershed using a boxplot.

# boxplot of HLM residuals

boxplot(resid(H.logit)~trout$WSHD)

Did we take care of the spatial dependence? We can also create a Q-Q plot of the residuals.

#normal probability plot of residuals

qqnorm(resid(H.logit))

Similarly, we can extract the estimates of the fixed effects and their standard errors.

#extract fixed effect estimates

fixef(H.logit)

#extract fixed effect standard errors

sqrt(diag(vcov(H.logit)))

Hey, let’s create a data frame with fixed effect parameter estimates, standard errors, and 90% confidence intervals.

## grab estimates and SE and combine

parms<-cbind(fixef(H.logit),sqrt(diag(vcov(H.logit))))

## Give the columns the correct names

colnames(parms)<- c("Estimate","Std.Error")

# coerce the matrix into a data frame

parms <- as.data.frame(parms)

# calculate lower and upper 90% CL using t-value 1.64

parms$Lower = parms$Estimate - 1.64*parms$Std.Error

parms$Upper = parms$Estimate + 1.64*parms$Std.Error

#print it out

parms

Wasn’t that Fun?This Weeks Assignment

Due 1 week from today by 5pm Pacific. Download ground bird data.csv (description below): Using the data set complete the following:

1) run pairwise correlations between all independent variables (everything except Density)

2) Fit 2 linear regression models with at least 3 independent variables in each.

3) For each model (in #2):

a) plot residuals by predicted values

b) calculate AICc

c) create a data frame with parameter estimates and 90% confidence limits

d) Write each data frame (from c) to a comma delimited file

4) Create a new variable “Presence” in the ground bird data that have the value of 0 if Density = 0 and one (1) when Density > 0 (hint: think ifelse function).

5) Fit 2 logistic regression models with the Presence data that have the same independent variables as in #2 above.

Please save all of the code that you used in a single script and submit the script in an email attachment.

The Data

The dataset ground bird data.csv contains data on the density of a federally-threatened ground nesting bird and landscape characteristics for a random selection of federal land management units in the Interior Columbia River Basin. We are interested in the relationship between of landscape characteristics and ground bird density. The file contains the following variables:

Density - bird density (no. per ha)

Slope- the mean slope of the management unit

Elev.m – the mean elevation of the management unit in meters

Ave.temp - the average temperature of the management unit during the study period

Ave.ppt- the average precipitation of the management unit during the study period

Pct.graze - the percentage of the management unit that is grazed

Pct.grass - the percentage of the management unit that is comprised of grasslands

Road.dens - the density of roads in the management unit