3.2. Analysis of frequency data

Frequency tables and tests of independence

Analysis of frequency data starts with organizing the data into tables whose dimensions are levels or combinations of levels of the different factors (groups) under examination, and whose entries are the numbers of observations (or frequencies) that occur in each group. Let's illustrate this will a simple example of feeding data for Red-headed Woodpeckers; the data are located here. The data consist of observations of birds over 2 years (Year =2006 or 2007), categorized by the sex of the feeding bird (Female, Sex="F" or Male, Sex="M") and the type of food item consumed (Animal, Food="A" or Plant, Food="P").

Before building a table, we should convert year from numerical variable to a factor (this is done automatically for the character variable Sex and Food). The code to read and summarize the data are (also contained in a script file).

> birds<-read.csv("RHWO.csv")

> #make year a factor variable

> birds$Year<-factor(birds$Year,labels=c("2006","2007"))

> #summarize data

> summary(birds)

Frequency tables are easy to build in R using the "table" function. To create a 3-d table of YearXSexXFood we simply use the command

> tab1<-with(birds,table(Year,Sex,Food))

> tab1

, , Food = A

Sex

Year F M

2006 36 49

2007 213 157

, , Food = P

Sex

Year F M

2006 22 51

2007 50 58

> t1<-as.data.frame(tab1)

> t1

Year Sex Food Freq

1 2006 F A 36

2 2007 F A 213

3 2006 M A 49

4 2007 M A 157

5 2006 F P 22

6 2007 F P 50

7 2006 M P 51

8 2007 M P 58

The summary function provides the number of cases (observations) and factors (3) and produces a chi-square test of mutual independence of the factors (which, frankly, is generally not too interesting)

> summary(tab1)

Number of cases in table: 636

Number of factors: 3

Test for independence of all factors:

Chisq = 62.46, df = 4, p-value = 8.822e-13

By the way, the number of "cases" is also gotten by summing the frequencies of the table

> sum(tab1)

[1] 636

or simply checking for how many observations there were in the original data

> nrow(birds)

[1] 636

All 3 agree, as they should. Finally, it is often useful to coerce frequency tables into dataframes, for example

> t1<-as.data.frame(tab1)

> t1

Year Sex Food Freq

1 2006 F A 36

2 2007 F A 213

3 2006 M A 49

4 2007 M A 157

5 2006 F P 22

6 2007 F P 50

7 2006 M P 51

8 2007 M P 58

The frame t1 now contains all the essential information in summary form, namely the factor levels and frequencies for each factor combination, and may be itself be used as input in an analysis.

It's easy to form reduced-dimension tables with the table command. For instance, we can ignore Year and construct a table of Sex X Food (pooled over years) :

> tab2<-with(birds,table(Sex,Food))

We can then summarize the reduced table and perform a test of independence between Sex and Food:

> summary(tab2)

Number of cases in table: 636

Number of factors: 2

Test for independence of all factors:

Chisq = 11.572, df = 1, p-value = 0.0006696

An alternative (that works for 2-d tables ) is the chisq.test function, which provides by default a continuity correction

> #continuity correction

> chisq.test(tab2)

Pearson's Chi-squared test with Yates' continuity correction

data: tab2

X-squared = 10.9815, df = 1, p-value = 0.0009203

Notice that in the absence of this correction the chi-square result is identical to that provided by the summary function.

> #no continuity correction

> chisq.test(tab2,correct=FALSE)

Pearson's Chi-squared test

data: tab2

X-squared = 11.5717, df = 1, p-value = 0.0006696

Similar commands can be used to form 2-d tables and corresponding tests for Year X Food and Year X Sex, as shown in the accompanying R script.

Frequency tables can be converted to tables of proportions by the prop.table function. For example, the command

> prop.table(tab1)

returns cell proportions for each combination of Year, Sex, and Food:

, , Food = A

Sex

Year F M

2006 0.05660377 0.07704403

2007 0.33490566 0.24685535

, , Food = P

Sex

Year F M

2006 0.03459119 0.08018868

2007 0.07861635 0.09119497

These cell probabilities will sum to 1 over all the cells:

> sum(prop.table(tab1))

[1] 1

Often, it is of more interest to examine relationships that are conditional on some factor; this is done by creating probabilities that sum to 1 marginally. For example,

>t<-prop.table(tab1,margin=1)

provides proportions for Sex X Food conditional on year.

, , Food = A

Sex

Year F M

2006 0.2278481 0.3101266

2007 0.4456067 0.3284519

, , Food = P

Sex

Year F M

2006 0.1392405 0.3227848

2007 0.1046025 0.1213389

So if we take the first marginal (year =2006) we get

> t[1,,]

Food

Sex A P

F 0.2278481 0.1392405

M 0.3101266 0.3227848

and if we sum these proportions, we get 1.

> sum(t[1,,])

[1] 1

In other words, the proportions of Sex X Food each year are conditional on the year totals. The script shows how we can form similar marginal tables for various other combinations.

Binomial tests

One variant on the above is that we can consider certain outcomes as mutually exclusive/ exhaustive; in the case where there are 2 possible outcomes (like, Food=A or Food =P), one can arbitrarily be labeled a "success" (1) and the other a "failure" (0), giving rise to binomial tests. For the woodpecker example, if we momentarily ignore Year effects, we can create a table of Sex X Food using apply

> by.sex<-apply(tab1,c(2,3),FUN=sum)

> by.sex

Food

Sex A P

F 249 72

M 206 109

The number of trials (cases) for each sex is simply the frequency of data for that sex.

> trials<-with(birds,table(Sex))

> trials

Sex

F M

321 315

Let's say we arbitrarily define "success" as use of animal material. Then,

> success<-by.sex[,1]

> success

F M

249 206

We can then compute the probability of success (i.e., use of animal material) for each sex, and compute a test of the equality of this probability between sexes.

> prop.test(success,trials)

2-sample test for equality of proportions with continuity correction

data: success out of trials

X-squared = 10.9815, df = 1, p-value = 0.0009203

alternative hypothesis: two.sided

95 percent confidence interval:

0.04900448 0.19446088

sample estimates:

prop 1 prop 2

0.7757009 0.6539683

Note that binomial proportions, standard errors, and test can also be directly calculated. E.g.,

> #direct calculations

> p<-success/trials

> p

Sex

F M

0.7757009 0.6539683

> var.p<-p*(1-p)/trials

> se.p<-sqrt(var.p)

> p

Sex

F M

0.7757009 0.6539683

> #Standard error

> se.p

Sex

F M

0.02328136 0.02680285

could be used in the calculation of z-, chi-squared, or other tests.

Likelihood approach using generalized linear models

The above approach, while somewhat useful for descriptive purposes, has limitations. The most obvious is that there is not a unified framework for making inferences about the effects of factors, or, for that matter, a real distinction between grouping factors (like Sex or Year) versus factors that are really more akin to responses (like "Animal" vs. "Plant"). Obviously too, the approach requires ad hoc collapsing and testing of frequency tables. There are various alternatives for frequency tables, but the one I favor is based on

Clear delineation of one factor as the response
Treatment of the other factors (including if involved continuous predictors) as explanatory
Stipulation of a residual (i.e., after the expected response predicted by the model) error distribution.
Construction of alternative models including additive effects
Comparison of models by AIC and AIC model-averaging

The glm procedure allows creating of "generalized linear models", in which the

a response is stipulated, which may be continuous or discrete,
model parameters are combined in linear functions (this allows for many generalizations including those involving polynomial terms or transformations), and
an residual distribution family is stipulated, appropriate to the response.

We will consider generalized linear models in more detail later; for our purposes here, the response is Bernoulli (either binary 1 or 0 or x successes in n trials) and the distribution family is binomial. Below are several models I created for the woodpecker example:

> #### binomial glms

Intercept only

> mod.0<-glm(Food~1,data=birds,family=binomial)

Sex effect (intercept included by default)

> mod.sex<-glm(Food~Sex,data=birds,family=binomial)

Year effect (intercept included by default)

> mod.year<-glm(Food~Year,data=birds,family=binomial)

Year + Sex additive effect (intercept included by default)

> mod.sex.p.year<-glm(Food~Year+Sex,data=birds,family=binomial)

Year + Sex + Year*Sex interactives effect (intercept included by default)

> mod.sex.year<-glm(Food~Year*Sex,data=birds,family=binomial)

A built-in AIC function computes AIC and other summary statistics for each model and places the results in a dataframe, which is then sorted by ascending AIC values (with lowest AIC indicating best support)

> aic<-AIC(mod.0,mod.sex,mod.year, mod.sex.p.year,mod.sex.year)

> aic[order(aic$AIC),]

df AIC

mod.sex.p.year 3 728.1623

mod.sex.year 4 730.1236

mod.year 2 732.9375

mod.sex 2 752.0557

mod.0 1 761.6860

We can also construct predictions from a selected model, in which we specify arbitrary levels (e.g., under a future study) of the predictors variables using a dataframe as input. For example, the code below predicts the probability p of animal material by sex for each year under the Sex * Year model. Note that the linear model is in terms of the logit function of p, where logit(p) = ln(p/(1-p)); the function plogis back-transforms logit(p) to p.

> #predictions

> new.data<-data.frame(Sex=c("M","M","F","F"),Year=rep(c("2006","2007"),2))

> #predictions under interaction model (gives proportion table margins)

> pred.mod<-predict(mod.sex.year,newdata=new.data,se.fit=T)

> pred.mod$p.pred<-plogis(pred.mod$fit)

> pred.mod$lci<-plogis(pred.mod$fit-1.96*pred.mod$se.fit)

> pred.mod$uci<-plogis(pred.mod$fit+1.96*pred.mod$se.fit)

> predictions<-cbind(new.data,pred.mod)

> predictions

Sex Year fit se.fit residual.scale p.pred lci uci

1 M 2006 0.04000533 0.2000400 1 0.5100000 0.4128798 0.6063711

2 M 2007 -0.99580279 0.1536580 1 0.2697674 0.2146750 0.3330032

3 F 2006 -0.49247649 0.2706147 1 0.3793103 0.2644665 0.5094809

4 F 2007 -1.44926916 0.1571459 1 0.1901141 0.1471319 0.2420885

The attached script file calculates these predictions under the top AIC model (Sex+Year) and performs some nice plots including confidence intervals.

Next: One and 2-sample t-tests, ANOVA

Page updated

Google Sites

Report abuse