One-Way Analysis of Variance (ANOVA) is a technique for studying the relationship between a quantitative dependent variable and a single qualitative independent variable. Usually we are interested in whether the level of the dependent variable differs for different values of the qualitative variable. We will use as an example real data from a study reported in 1935 by B. Lowe of the Iowa Agricultural Experiment Station.* Perhaps this originated at coffee break one morning. Donuts are traditionally a fried food and as such absorb some of the fat they are fried in. The amount and type of fat absorbed has implications for the healthfulness of the donuts. This study investigated whether there was any relationship between the quantitative variable "amount of fat absorbed" and a qualitative variable "type of fat". (Unfortunately we do not know just what the fats were. You could think of them as corn oil, soybean oil, lard, and Quaker State.) You can find the data at our site as a plain text file and as an Excel spreadsheet. Import it from the RStudio tab using either the Web URL or dowload it and import from a text file. Be careful to note the "Heading" checkbox. Name the data.frame() "donuts".
> attach(donuts)
Then you can work with them providing you remember that R is case-sensitive.
> Fat1 [1] 164 172 168 177 156 195
ANOVA is commonly used with experimental studies and that is the case here. The experiment consists of frying some donuts in each of four fats. Twenty-four batches of donuts were prepared and six randomly assigned to each of the four fats. The results, in grams of fat absorbed for each batch, and as they might commonly be laid out on a page (and are laid out in the file) were:
While this is a reasonable arrangement for purposes of page layout, and very common for this type of data, it obscures the structure of the data and may confuse statistical software. The experimental units here are batches of donuts, and for each batch we write down two things: one value for the quantitative variable fat absorbed and one value for the qualitative variable type of fat. Here is how R can change the page layout format into a format that is more logical and easier for statistical software to deal with.
> sdonuts = stack(donuts) > sdonuts values ind 1 164 Fat1 2 172 Fat1 3 168 Fat1 4 177 Fat1 5 156 Fat1 6 195 Fat1 7 178 Fat2 8 191 Fat2 9 197 Fat2 10 182 Fat2 11 185 Fat2 12 177 Fat2 13 175 Fat3 14 193 Fat3 15 178 Fat3 16 171 Fat3 17 163 Fat3 18 176 Fat3 19 155 Fat4 20 166 Fat4 21 149 Fat4 22 164 Fat4 23 170 Fat4 24 168 Fat4
A slightly more cumbersome way of doing it, but using commands we have already used before is
> Values = c(Fat1, Fat2, Fat3, Fat4)
> Ind = c(rep("Fat1", times=6), rep("Fat2", times=6), rep("Fat3", times=6), rep("Fat4", times=6))
We can compare the four fats by looking at summary statistics or at multiple boxplots.
> summary(donuts) > summary(sdonuts) > attach(sdonuts) > boxplot(values ~ ind)
It certainly looks like more of Fat 2 gets absorbed while Fat 4 seems least absorbed. But wait a minute! If we repeated the experiment we would most likely get different numbers. Could this change the rankings of the fats? Is it possible that all four fats are absorbed to about the same degree and we are just seeing random fluctuations from one assignment of batches to fats to another? To see if that is likely we do a hypothesis test. The null as usual is backwards: we hypothesize no difference among the fats. As always, the null provides a specific model with which we can play "what if". If the null were true, would such differences be ordinary or extraordinary?
> lm1 = lm(values ~ ind) > summary(lm1)
The p-value of 0.006876 is for a test of the hypothesis that the mean amount of fat absorbed is the same for all four types of fat. Because it is so small, we reject the hypothesis of equal absorption.
Like any statistical test, this one is based on some assumptions. We will only mention the ones we can check with software. These are two: that the numbers for each fat are normally distributed and that they share a common variance. We can check these roughly from the boxplots. There we see roughly similar spreads and no serious departures from normality.
If we see signs the assumptions are not met then the remedies are similar to what they are in the univariate case. For example, outliers or bimodality must be investigated as to their cause. A transformation of the dependent variable may help just as it can in the univariate case. However, it is most likely to be effective if all the groups are skewed, and in the same direction, or if there is a systematic change in variability with amount of fat absorbed.
Exercises
1. Conduct a one-way ANOVA with "sat" as the response variable and "region" as the explanatory variable. What do you discover?
2. Check that the assumptions of the one-way ANOVA are satisfied for question 1.
*Our source is Chapter 12 of Snedecor and Cochran, Statistical Methods (7th. ed.), 1980, Iowa State University Press, Ames, IA.