Frequency tables and tests of independence
Analysis of frequency data starts with organizing the data into tables whose dimensions are levels or combinations of levels of the different factors (groups) under examination, and whose entries are the numbers of observations (or frequencies) that occur in each group. Let's illustrate this will a simple example of feeding data for Red-headed Woodpeckers; the data are located here. The data consist of observations of birds over 2 years (Year =2006 or 2007), categorized by the sex of the feeding bird (Female, Sex="F" or Male, Sex="M") and the type of food item consumed (Animal, Food="A" or Plant, Food="P").
Before building a table, we should convert year from numerical variable to a factor (this is done automatically for the character variable Sex and Food). The code to read and summarize the data are (also contained in a script file).
> birds<-read.csv("RHWO.csv")
> #make year a factor variable
> birds$Year<-factor(birds$Year,labels=c("2006","2007"))
> #summarize data
> summary(birds)
Frequency tables are easy to build in R using the "table" function. To create a 3-d table of YearXSexXFood we simply use the command
> tab1<-with(birds,table(Year,Sex,Food))
> tab1
, , Food = A
Sex
Year F M
2006 36 49
2007 213 157
, , Food = P
Sex
Year F M
2006 22 51
2007 50 58
> t1<-as.data.frame(tab1)
> t1
Year Sex Food Freq
1 2006 F A 36
2 2007 F A 213
3 2006 M A 49
4 2007 M A 157
5 2006 F P 22
6 2007 F P 50
7 2006 M P 51
8 2007 M P 58
The summary function provides the number of cases (observations) and factors (3) and produces a chi-square test of mutual independence of the factors (which, frankly, is generally not too interesting)
> summary(tab1)
Number of cases in table: 636
Number of factors: 3
Test for independence of all factors:
Chisq = 62.46, df = 4, p-value = 8.822e-13
By the way, the number of "cases" is also gotten by summing the frequencies of the table
> sum(tab1)
[1] 636
or simply checking for how many observations there were in the original data
> nrow(birds)
[1] 636
All 3 agree, as they should. Finally, it is often useful to coerce frequency tables into dataframes, for example
> t1<-as.data.frame(tab1)
> t1
Year Sex Food Freq
1 2006 F A 36
2 2007 F A 213
3 2006 M A 49
4 2007 M A 157
5 2006 F P 22
6 2007 F P 50
7 2006 M P 51
8 2007 M P 58
The frame t1 now contains all the essential information in summary form, namely the factor levels and frequencies for each factor combination, and may be itself be used as input in an analysis.
It's easy to form reduced-dimension tables with the table command. For instance, we can ignore Year and construct a table of Sex X Food (pooled over years) :
> tab2<-with(birds,table(Sex,Food))
We can then summarize the reduced table and perform a test of independence between Sex and Food:
> summary(tab2)
Number of cases in table: 636
Number of factors: 2
Test for independence of all factors:
Chisq = 11.572, df = 1, p-value = 0.0006696
An alternative (that works for 2-d tables ) is the chisq.test function, which provides by default a continuity correction
> #continuity correction
> chisq.test(tab2)
Pearson's Chi-squared test with Yates' continuity correction
data: tab2
X-squared = 10.9815, df = 1, p-value = 0.0009203
Notice that in the absence of this correction the chi-square result is identical to that provided by the summary function.
> #no continuity correction
> chisq.test(tab2,correct=FALSE)
Pearson's Chi-squared test
data: tab2
X-squared = 11.5717, df = 1, p-value = 0.0006696
Similar commands can be used to form 2-d tables and corresponding tests for Year X Food and Year X Sex, as shown in the accompanying R script.
Frequency tables can be converted to tables of proportions by the prop.table function. For example, the command
> prop.table(tab1)
returns cell proportions for each combination of Year, Sex, and Food:
, , Food = A
Sex
Year F M
2006 0.05660377 0.07704403
2007 0.33490566 0.24685535
, , Food = P
Sex
Year F M
2006 0.03459119 0.08018868
2007 0.07861635 0.09119497
These cell probabilities will sum to 1 over all the cells:
> sum(prop.table(tab1))
[1] 1
Often, it is of more interest to examine relationships that are conditional on some factor; this is done by creating probabilities that sum to 1 marginally. For example,
>t<-prop.table(tab1,margin=1)
provides proportions for Sex X Food conditional on year.
>t
, , Food = A
Sex
Year F M
2006 0.2278481 0.3101266
2007 0.4456067 0.3284519
, , Food = P
Sex
Year F M
2006 0.1392405 0.3227848
2007 0.1046025 0.1213389
So if we take the first marginal (year =2006) we get
> t[1,,]
Food
Sex A P
F 0.2278481 0.1392405
M 0.3101266 0.3227848
and if we sum these proportions, we get 1.
> sum(t[1,,])
[1] 1
In other words, the proportions of Sex X Food each year are conditional on the year totals. The script shows how we can form similar marginal tables for various other combinations.
Binomial tests
One variant on the above is that we can consider certain outcomes as mutually exclusive/ exhaustive; in the case where there are 2 possible outcomes (like, Food=A or Food =P), one can arbitrarily be labeled a "success" (1) and the other a "failure" (0), giving rise to binomial tests. For the woodpecker example, if we momentarily ignore Year effects, we can create a table of Sex X Food using apply
> by.sex<-apply(tab1,c(2,3),FUN=sum)
> by.sex
Food
Sex A P
F 249 72
M 206 109
The number of trials (cases) for each sex is simply the frequency of data for that sex.
> trials<-with(birds,table(Sex))
> trials
Sex
F M
321 315
Let's say we arbitrarily define "success" as use of animal material. Then,
> success<-by.sex[,1]
> success
F M
249 206
We can then compute the probability of success (i.e., use of animal material) for each sex, and compute a test of the equality of this probability between sexes.
> prop.test(success,trials)
2-sample test for equality of proportions with continuity correction
data: success out of trials
X-squared = 10.9815, df = 1, p-value = 0.0009203
alternative hypothesis: two.sided
95 percent confidence interval:
0.04900448 0.19446088
sample estimates:
prop 1 prop 2
0.7757009 0.6539683
Note that binomial proportions, standard errors, and test can also be directly calculated. E.g.,
> #direct calculations
> #direct calculations
> p<-success/trials
> p
Sex
F M
0.7757009 0.6539683
>
> var.p<-p*(1-p)/trials
> var.p<-p*(1-p)/trials
> se.p<-sqrt(var.p)
> p
Sex
F M
0.7757009 0.6539683
> #Standard error
> se.p
Sex
F M
0.02328136 0.02680285
>
could be used in the calculation of z-, chi-squared, or other tests.
Likelihood approach using generalized linear models
The above approach, while somewhat useful for descriptive purposes, has limitations. The most obvious is that there is not a unified framework for making inferences about the effects of factors, or, for that matter, a real distinction between grouping factors (like Sex or Year) versus factors that are really more akin to responses (like "Animal" vs. "Plant"). Obviously too, the approach requires ad hoc collapsing and testing of frequency tables. There are various alternatives for frequency tables, but the one I favor is based on
The glm procedure allows creating of "generalized linear models", in which the
We will consider generalized linear models in more detail later; for our purposes here, the response is Bernoulli (either binary 1 or 0 or x successes in n trials) and the distribution family is binomial. Below are several models I created for the woodpecker example:
> #### binomial glms
Intercept only
> mod.0<-glm(Food~1,data=birds,family=binomial)
Sex effect (intercept included by default)
> mod.sex<-glm(Food~Sex,data=birds,family=binomial)
Year effect (intercept included by default)
> mod.year<-glm(Food~Year,data=birds,family=binomial)
Year + Sex additive effect (intercept included by default)
> mod.sex.p.year<-glm(Food~Year+Sex,data=birds,family=binomial)
Year + Sex + Year*Sex interactives effect (intercept included by default)
> mod.sex.year<-glm(Food~Year*Sex,data=birds,family=binomial)
>
A built-in AIC function computes AIC and other summary statistics for each model and places the results in a dataframe, which is then sorted by ascending AIC values (with lowest AIC indicating best support)
> aic<-AIC(mod.0,mod.sex,mod.year, mod.sex.p.year,mod.sex.year)
> aic[order(aic$AIC),]
df AIC
mod.sex.p.year 3 728.1623
mod.sex.year 4 730.1236
mod.year 2 732.9375
mod.sex 2 752.0557
mod.0 1 761.6860
We can also construct predictions from a selected model, in which we specify arbitrary levels (e.g., under a future study) of the predictors variables using a dataframe as input. For example, the code below predicts the probability p of animal material by sex for each year under the Sex * Year model. Note that the linear model is in terms of the logit function of p, where logit(p) = ln(p/(1-p)); the function plogis back-transforms logit(p) to p.
> #predictions
> new.data<-data.frame(Sex=c("M","M","F","F"),Year=rep(c("2006","2007"),2))
> #predictions under interaction model (gives proportion table margins)
> pred.mod<-predict(mod.sex.year,newdata=new.data,se.fit=T)
> pred.mod$p.pred<-plogis(pred.mod$fit)
> pred.mod$lci<-plogis(pred.mod$fit-1.96*pred.mod$se.fit)
> pred.mod$uci<-plogis(pred.mod$fit+1.96*pred.mod$se.fit)
> predictions<-cbind(new.data,pred.mod)
> predictions
Sex Year fit se.fit residual.scale p.pred lci uci
1 M 2006 0.04000533 0.2000400 1 0.5100000 0.4128798 0.6063711
2 M 2007 -0.99580279 0.1536580 1 0.2697674 0.2146750 0.3330032
3 F 2006 -0.49247649 0.2706147 1 0.3793103 0.2644665 0.5094809
4 F 2007 -1.44926916 0.1571459 1 0.1901141 0.1471319 0.2420885
>
The attached script file calculates these predictions under the top AIC model (Sex+Year) and performs some nice plots including confidence intervals.