Built in statistical functions and basic statistical analysis
As we have seen, R has a number of built in statistical functions for simple summary statistics (mean, standard deviation, median, mode, etc.). R is also capable of performing many (if not most or all) standard statistical tests.
Normality assumptions
The tests that follow are based on the assumption of a Normal error distribution, meaning that data themselves are Normally and independently distributed, or more precisely, that the residual error left over after fitting the statistical model follows the Normal distribution. This assumption will only be approximately true for real data, and there is actually a fair amount of tolerance to deviations from Normality that will not seriously bias conclusions. A number of diagnostic tests for Normality exist in R; often plots of residuals will reveal serious problems if they exist.
In some cases though the data will obviously not follow Normal assumptions (e.g., count or proportion data). One approach in these cases is to transform the data, but it will often be better to use a more appropriate error distribution, something we will discuss in Week 7.
t-tests
We will illustrate 2-sample and paired t-tests with an example using the built-in dataframe "sleep", in which a drug is sleep enhancement drug administered to a treatment group (group2) and withheld from a control (group2). The data are obtained by the command
> data(sleep)
I have added a couple of additional commands to more clearly label the groups treatment and control (script file here) . Here are the resulting data:
extra group ID trt
1 0.7 1 1 Control
2 -1.6 1 2 Control
3 -0.2 1 3 Control
4 -1.2 1 4 Control
5 -0.1 1 5 Control
6 3.4 1 6 Control
7 3.7 1 7 Control
8 0.8 1 8 Control
9 0.0 1 9 Control
10 2.0 1 10 Control
11 1.9 2 1 Treatment
12 0.8 2 2 Treatment
13 1.1 2 3 Treatment
14 0.1 2 4 Treatment
15 -0.1 2 5 Treatment
16 4.4 2 6 Treatment
17 5.5 2 7 Treatment
18 1.6 2 8 Treatment
19 4.6 2 9 Treatment
20 3.4 2 10 Treatment
>
>
A quick summary by group suggests that there might be a treatment effect on the resulting number of extra hours of sleep:
> aggregate(extra~trt,data=sleep,FUN=mean)
trt extra
1 Control 0.75
2 Treatment 2.33
We commence to conduct a 2-sided t-test under the assumption that there are 20 subjects randomly assigned to control and treatment. The general syntax for the t.test function is
t.test(response~group,data)
where response is the measurement variable recorded for each subject in each treatment group, group is the variable name describing which group (e.g., treatment or control) the subject is a member of, and data is a data frame containing the observations on the response by group and subject in group. Note that t.test can also be used in a with() function to associate the model with data, for example
with(data,t.test(response~group))
but this tends to be more verbose.
The attached script applies the t.test function to a 2-sided (naive) alternative for a 2-sample t-test (where subjects are assigned at random to treatment or control); a 1-sided, 2-sample t-test; and 1- and 2-sided alternatives for a paired t-test .
1-sided alternatives are justified when it is known (or thought) a priori that the treatment has a directional effect on response. In the sleep example it is thought that the treatment increases sleep. Because the ordering in t.test based on the group labels, this means that the direction is Control < Treatment. So, a 1-sided alternative of "less" is specified (by default, the test is 2-sided).
In paired t-tests, instead of having 20 subjects, randomly assigned 10 each to treatment and control, we have just 10, each of which gets the both treatment and control (this was actually the supposed design of the study; we aren't told if the treatment were administered randomly or in order!). This calls for a paired t-test, and can be implemented by t.test with a slight change (we assume that sleep$ID represents the unique ID for each individual, so we are pairing extra[ID==1 & trt="Control"] with extra[ID==1 & trt="Treatment"], etc. As another paired example, subjects might be naturally paired (e.g., twins), with members of each twin pair randomly assigned to treatment or control. In either case, there are effectively 1/2 as many subjects (individuals in the first case, twin pairs in the second). Paired t-tests are formed in t.test by specifying "paired=TRUE" as an input into the function.
When justified, a paired t-test will generally be more sensitive, because individual effects are effectively blocked (absorbed) in the analysis by the pairing process.
ANOVA
Analysis of variance (ANOVA) performs tests based on more than 2 groups or combinations of groups (factors).
We can illustrate a simple (1-way) ANOVA by another built-in example involving the effects of different insect sprays on counts of insects. The data are obtained by
> data(InsectSprays)
> summary(InsectSprays)
count spray
Min. : 0.00 A:12
1st Qu.: 3.00 B:12
Median : 7.00 C:12
Mean : 9.50 D:12
3rd Qu.:14.25 E:12
Max. :26.00 F:12
Computation of means by spray type reveal apparent differences (interestingly, there does not seem to be a control [no spray] treatment. Hmm):
> aggregate(count~spray,data=InsectSprays,FUN=mean)
spray count
1 A 14.500000
2 B 15.333333
3 C 2.083333
4 D 4.916667
5 E 3.500000
6 F 16.666667
Various plots and other descriptive statistics can be computed and displayed.
> require(stats); require(graphics)
> boxplot(count ~ spray, data = InsectSprays,
+ xlab = "Type of spray", ylab = "Insect count",
+ main = "InsectSprays data", varwidth = TRUE, col = "lightgray")
Cutting to the chase, the ANOVA of spray type effects on count is formed by
> fm1 <- aov(count ~ spray, data = InsectSprays)
> summary(fm1)
Df Sum Sq Mean Sq F value Pr(>F)
spray 5 2669 533.8 34.7 <2e-16 ***
Residuals 66 1015 15.4
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> opar <- par(mfrow = c(2, 2), oma = c(0, 0, 1.1, 0))
> plot(fm1)
There is some evidence of non-normality and other assumption violations, so we can trot out the usual trick of transformation. We'll see later, though, that maybe it's better to model these counts like the integer data they are....
> fm2 <- aov(sqrt(count) ~ spray, data = InsectSprays)
> summary(fm2)
Df Sum Sq Mean Sq F value Pr(>F)
spray 5 88.44 17.688 44.8 <2e-16 ***
Residuals 66 26.06 0.395
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> plot(fm2)
> par(opar)
Note that above we used AOV to construct simple linear models of the effects of factors (e.g., spray) on a response (count or sqrt(count)). Actually t-tests, ANOVA, regression, and analysis of covariance all fall under the linear models framework, which we will generally handle using the lm and the more general glm function.
Next: Linear models