Stats text 11

Chapter 11 Two sample t-tests to calculate a p-value

Chapter eleven in this edition of the text is somewhat abbreviated from earlier editions of the text to focus more closely on the current version of the course.

11.1 Testing for a pairwise difference between two samples: Dependent samples

Many studies investigate systems where there are measurements taken before and after. Usually there is an experimental treatment or process between the two measurements. A typical such system would be a pre-test and a post-test. Inbetween the pre-test and the post-test would typically be an educational or training event. One could examine each student's score on the pre-test and the post-test. Even if everyone did better on the post-test, one would have to prove that the difference was statistically significant and not just a random event.

These studies are called "paired t-tests" or "inferences from matched pairs". Each element in the sample is considered as a pair of scores. The null hypothesis would be that the average difference for all the pairs is zero: there is no difference. For a confidence interval test, the confidence interval for the mean differences would include zero if there is no statistically significant difference.

When we say the sample mean before is "equal" we mean "statistically equal," not mathematically equal. We mean that there is no statistically significant difference between the before and after means at some level of confidence. Statistically speaking we say that the two samples could come from the same population.

In case II the difference in the sample means is too large for that difference to likely be zero. Statistically speaking we say that the two samples come from different populations.

If the difference for each data pair is referred to as d, then the mean difference could be written d. The hypothesis test is whether this mean difference d could come from a population with a mean difference μd equal to zero (the null hypothesis). If the mean difference d could not come from a population with a mean difference μd equal to zero, then the change is statistically significant. In the diagram above the mean difference μd is equal to μbefore − μ after.

Paired two sample hypothesis test: Using the TTEST function to obtain a p-value directly from the two samples

Spreadsheets provide a function to calculate the p-value for paired two sample data using the student's t-distribution. This function is the TTEST function. If the p-value is less than your chosen risk of a type I error α then the difference is significant. The function does not require generating the difference column d as seen above, only the original data is used in this function.

The function takes as inputs the before data (data_range_pre), the after data (data_range_post), the number of tails, and a final variable that specifies the type of test. A paired t-test is test type number one.

=TTEST(data_range_pre,data_range_post,2,1)

To ensure that the spreadsheet calculates the p-value correctly, all of the data must be in pairs. If the data is paired data but there are unpaired values, those would have to be removed prior to calculating the p-value.

Note too that while many paired t-tests for a difference of sample means involve "pre" and "post" data (before and after measurements), there are situations in which the paired data is not pre and post in terms of time.

The smallest alpha for which we could say the difference is statistically significant is 1 − p-value. That said, alpha should be chosen prior to running the hypothesis test. In this course we use a p-value of 0.05

In the above example 14 students were given pairs of marbles. They held one marble in their left hand and one marble in their right hand. They were to decide which marble was heavier. Then the marbles were massed. The differences in the masses of the marble pairs were on the order of tenths of a gram, much less than a single small paperclip. Did the students pick the heavier marble at a rate that exceeds just random chance? Was their detection rate statistically significant? Can they really tell such small differences?

With the data in columns B2:B25 and C2 to C25, the function to calculated the pairwise p-value is =TTEST(B2:B25,C2:C25,2,1) where the "2" refers to two tails and the "1" tells the spreadsheet to use a pairwise calculation for the p-value. An alpha of 0.05 is appropriate in this experiment.

The p-value was 0.04358. This is LESS than the alpha of 0.05. This is surprising. The students really did detect a difference in the marble masses. We reject a null hypothesis of no difference in the pairwise means. The result is statistically significant.

11.2 Testing for a difference of sample means between two independent samples

One of the more common situations is when one is seeking to compare two independent samples to determine if the means for each sample are statistically significantly different. In this case the samples may differ in sample size n, sample mean, and sample standard deviation.

In this text the two samples are referred to as the x1 data and the x2 data. The use of the same variable, x, refers to a comparison of sample means being a comparison between two variables that are the same. The test is to see whether the two samples could both come from the same population X. The sample size for the x data is nx1. The sample mean for the x1 data is x1. The sample standard deviation for the x1 data is sx1. For the x2 data, the sample size is nx2, the sample mean is x2, and the sample standard deviation is sx2.

When we say the sample means are "equal" we mean "statistically equal," not mathematically equal. We mean that there is no statistically significant difference between the two sample means. Statistically speaking we say that the two samples could come from the same population.

In case II the difference in the sample means is too large for that difference to likely be zero. Statistically speaking we say that the two samples come from different populations.

Two possibilities exist. Either the two samples come from the same population and the population mean difference is statistically zero. Or the two samples come from different populations where the population mean difference is statistically not zero.

Note the sample mean tests are predicated on the two samples coming from populations X1 and X2 with population standard deviations σ1 = σ2 where the capital letters refer to the population from which the x1 and x2 samples were drawn respectively. "Fortunately it can usually be assumed in practice that since we most often wish to test the hypothesis that µ1 = µ2; it is rather unlikely that the two distributions should have the same means but different variances." (where the variance is the square of the standard deviation). [M. G. Bulmer, Principles of Statistics (Dover Books on Mathematics), Dover Publications (April 26, 2012)]. That said, knowledge of the system being studied and an understanding of population distribution would be important to a formal analysis. In this introductory text the focus is on basic tools and operations, providing a foundation on which to potentially build a more nuanced understanding of statistics.

11.21 Confidence Interval tests

When working with two independent samples, testing for a difference of means can also be explored using confidence intervals for each sample. Confidence intervals for each sample provide more information than a p-value and the declaration of a significant difference is more conservative. Confidence intervals for each sample cannot sort out the indeterminate case where the intervals overlap each other but not the other sample mean. The following diagrams show three different possible relationships between the confidence intervals and the mean. There are more possibilities, these are meant only as samples for guidance. Sample one has a sample mean x1, sample two has a sample mean x2.

In the chart on the left above the confidence intervals for the population mean each overlap the other sample mean, there is no statistically significant difference. For the middle diagram the confidence intervals do not include the other sample mean, the two means are statistically significantly different. In the diagram on the right the means may or may not be statistically significantly separated. In this case a p-value will have to be calculated using the TTEST function.

11.23 T-test for difference in independent sample means

As noted above, spreadsheets provide a function to calculate p-values. If the the p-value is less than your chosen risk of a type I error α then the difference is significant.

The function takes as inputs one the data for one if the two samples (data_range_x1), the data for the other sample (data_range_x2), the number of tails, and a final variable that specifies the type of test. A t-test for means from independent samples is test type number three.

=TTEST(data_range_x1,data_range_x2,number of tails,3)

The TTEST function does not use the smaller sample size to determine the degrees of freedom. The TTEST function uses a different formula that calculates a larger number of degrees of freedom, which has the effect of reducing the p-value. Thus the confidence interval result could produce a failure to reject the null hypothesis while the TTEST could produce a rejection of the null hypothesis. This only occurs when the p-value is close to your chosen α.

Guidelines for decision making with the p-value

When the p-value is MORE than our chosen risk of a type I error alpha (usually 0.05):

We are not surprised
We fail to reject the null hypothesis
No, the sample means are not significantly different
Yes, the two sample means could have come from the same population

When the p-value is LESS than our chosen risk of a type I error alpha (usually 0.05):

We are surprised
We reject the null hypothesis
Yes, the sample means are significantly different
No, the two sample means could not have come from the same population

11.3 Effect size

The effect size is whether a difference is small, medium, or large. The effect size can only be calculated if there is a significant difference in the means. If the p-value is larger than your pre-selected alpha, then the effect size is not calculated: there is no significant difference in the mean. In this course where an alpha of 0.05 is used, if the p-value is larger than 0.05, then the effect size is not calculated.

If there is no significant difference in the means then there is no effect size. If the result was a failure to reject the null hypothesis, then the effect size is meaningless and should not be reported.

The p-value provides information on how "surprising" is a result. A significant difference is surprising. The p-value does not tell one whether the difference is meaningful. For large sample sizes small differences might be surprising but not meaningful.

Suppose a pharmaceutical company has a treatment that cures a head cold in seven and a quarter days. Then they develop a new treatment that cures a head cold in seven days. Based on the p-value, the company might find that the difference is significant. The quarter day difference, however, might not be that meaningful.

For two sample means, the effect size provides an estimate of the standardized mean difference between two sample means. The effect size is mathematically related to z-scores. The effect size for a difference of independent sample means is referred to as Cohen's effect size d. The effect size is always calculated as an absolute or positive value.

The effect size for two sample means can be calculated from:

where sp is the pooled standard deviation:

Entering the pooled standard deviation in a spreadsheet requires a triple parentheses:

=SQRT(((n1-1)*s1^2+(n2-1)*s2^2)/(n1+n2-2))

...where:

n1 is the sample size for sample one
s1 is the sample standard deviation for sample one
n2 is the sample size for sample two
s2 is the sample standard deviation for sample two

Interpreting whether the difference in sample means has "meaning" in terms of the experiment is complex. Cohen provided some general guidelines. He also cautioned that these interpretations should be used with care. That said, in a beginning statistics course the guidelines provide a way to start thinking about effect size.

Cohen suggested that in the behavioral sciences an effect size of 0.2 is small, an effect size of 0.5 is medium, and an effect size of 0.8 is large. These values may not be correct for other fields of study. Educators in particular have noted that "small" effect sizes may still be important in education studies. The effect size is also affected by whether the data is normally distributed and is free of bias.

Think of effect size as a way to begin looking at whether the difference has real meaning, not just whether the difference is "surprising" from a p-value perspective.

Paper aircraft flight distance. On the left the flight distance for planes thrown on October 10. On the right throws from 26 October after instruction in a design with good distance characteristics

Cohen's effect size d calculation in a spreadsheet. The result was an effect size of 1.18, a very large effect size. The paper aircraft design taught to the class flies statistically significantly further, and the effect size is very large.

When calculating the effect size, calculate the positive value for the effect size. Remember, you can only calculate an effect size if you reject the null hypothesis, if your p-value is LESS than your alpha.