Two-Way Analysis of Variance (ANOVA) is a technique for studying the relationship between a quantitative dependent variable and two qualitative independent variables. Usually we are interested in whether the level of the dependent variable differs for different values of the qualitative variables. We will use as an example data from a student project reported in Stats: Data and Models (2nd ed.), by De Veaux, Velleman and Bock, Addison-Wesley, 2008, Chapter 29, Exercise 11. The student was interested in her success at basketball free throws. This study investigated whether there was any relationship between the quantitative variable "number of shots 'made' (i.e., successfully completed out of 50 tries)" and two qualitative variables "Time of Day" and "Shoes Worn". ANOVA is commonly used with experimental studies and that is the case here. You can find the data at our site as a plain text file and as anExcel spreadsheet. Download the text file now and save it to the directory where you installed R.
R can read data from a text file. The text file has to be in the form of a table with columns representing variables. All columns must be the same length. Missing data must be signified by "NA". Optionally, the first row of the file may contain names for the variables. To use the file you just downloaded in R you must define a variable to be equal to the contents of this file.
baskball <- read.table("baskball.txt",header=TRUE)
The argument header=TRUE tells R that the first row of the file should be interpreted as variable names. (There must be a name for every variable and the names must not have spaces in them). You can now get a table of contents for what you have created in R with
> objects()
This should return baskball along with any other variables you may have created. You will not see on this list any of the variables that are inside of baskball because they are hiding. To see them, type
> names(baskball) [1] "Time" "Shoes" "Made"
To bring them out of hiding, you must attach them to your R workspace.
> attach(baskball)
Then you can work with them providing you remember that R is case-sensitive.
We can compare the two times or the two shoes by looking at summary statistics or at parallel boxplots. To get the means for each level of each factor, use R's tapply command. This takes three arguments: the data you wish to summarize, the factor that determines the groups, and the function you wish to apply to each of the groups.
> tapply(Made,Time,mean) Morning Night 30.500 31.875 > tapply(Made,Shoes,mean) Favorite Others 32.750 29.625
Comparing the two sets of means, it looks like she does better at night and in her favorite shoes. But that could just be due to natural variability. We can check with ANOVA. We prefer to start with a model including interaction. R is a bit roundabout. We first run the ANOVA, store the results in a variable, and then generate a summary of those results.
> int <- aov(Made ~ Time*Shoes) > summary(int) Df Sum Sq Mean Sq F value Pr(>F) Time 1 7.562 7.562 0.3441 0.5684 Shoes 1 39.062 39.062 1.7773 0.2072 Time:Shoes 1 18.062 18.062 0.8218 0.3825 Residuals 12 263.750 21.979
The p-value for the interaction term of 0.3825 suggests we do not have to worry about interaction so repeat the process but with a simple additive model.
> noint <- aov(Made~Time + Shoes) > summary(noint) Df Sum Sq Mean Sq F value Pr(>F) Time 1 7.562 7.562 0.3489 0.5649 Shoes 1 39.062 39.062 1.8020 0.2024 Residuals 13 281.812 21.678
Unfortunately the p-values for both variables in both models are quite large, suggesting that any effect we saw could well have been due to chance. However, there is an alternative interpretation: with just 16 observations, we will only be able to detect a fairly large difference. It appeared the shoes made a difference of about 3 successes in 50 tries. If that is enough of a difference to matter in practice, we might repeat the experiment with more trials. Before we do that, though, we might make some displays to see if data of this sort matches the assumptions of ANOVA. No sense gathering more if it does not;-)
> boxplot(Made ~ Time)
> boxplot(Made ~ Shoes)
We can't expect perfection with only eight numbers in each group (How could eight numbers look bell-shaped?) but there are no signs here of serious skewness or outliers. All four groups have similar variabilities.
If we see signs the assumptions are not met then the remedies are similar to what they are in the univariate case. For example, outliers or bimodality must be investigated as to their cause. A transformation of the dependent variable may help just as it can in the univariate case. However, it is most likely to be effective if all the groups are skewed, and in the same direction, or if there is a systematic change in variability as group means increase.
Exercises
1. Conduct a two-way ANOVA with "sat" as the response variable and "region" and "party" as the explanatory variables. Compare your results to the one-way ANOVA tutorial. What do you discover?
2. How does the R syntax, and interpretation, of this result compare to a multiple regression model?