Stats text 10

Chapter 10 Hypothesis testing

10.1 Confidence Interval Testing

In this chapter we explore whether a sample has a sample mean x that could have come from a population with a known population mean μ. There are two possibilities. In Case I below, the sample mean x comes from the population with a known mean μ. In Case II, on the right, the sample mean x does not come from the population with a known mean μ. For our purposes the population mean μ could be a pre-existing mean, an expected mean, or a mean against which we intend to run the hypothesis test. In the next chapter we will consider how to handle comparing two samples to each other to see if they come from the same population.

In case I a sample taken from the population is likely to produce the sample mean seen for that particular sample. In case II a sample taken from the population is unlikely to produce the sample mean seen for that particular sample. Put another way, in case II the sample is not likely to have come from the population based on a significant difference between the sample mean and the population mean.

Suppose we want to do a study of whether the female students at the national campus gain body fat with age during their years at COM-FSM. Suppose we already know that the population mean body fat percentage for the new freshmen females 18 and 19 years old is μ = 25.4.

We measure a sample size n = 12 female students at the national campus who are 21 years old and older and determine that their sample mean body fat percentage is x = 30.5 percent with a sample standard deviation of sx = 8.7.

Can we conclude that the female students at the national campus gain body fat as they age during their years at the College?

Not necessarily. Samples taken from a population with a population mean of μ = 25.4 will not necessarily have a sample mean of 25.4. If we take many different samples from the population, the sample means will distribute normally about the population mean, but each individual mean is likely to be different than the population mean.

In other words, we have to consider what the likelihood of drawing a sample that is 30.5 - 25.4 = 5.1 units away from the population mean for a sample size of 12. If we knew more about the population distribution we would be able to determine the likelihood of a 12 element sample being drawn from the population with a sample mean 5.1 units away from the actual population mean.

In this case we know more about our sample and the distribution of the sample mean. The distribution of the sample mean follows the student's t-distribution. So we shift from centering the distribution on the population mean and center the distribution on the sample mean. Then we determine whether the confidence interval includes the population mean or not. We construct a confidence interval for the range of the population mean for the sample.

If this confidence interval includes the known population mean for the 18 to 19 years olds, then we cannot rule out the possibility that our 12 student sample is from that same population. In this instance we cannot conclude that the women gain body fat.

If the confidence interval does NOT include the known population mean for the 18 to 19 year old students then we can say that the older students come from a different population: a population with a higher population mean body fat. In this instance we can conclude that the older women have a different and probably higher body fat level.

One of the decisions we obviously have to make is the level of confidence we will use in the problem. Here we enter a contentious area. The level of confidence we choose, our level of bravery or temerity, will determine whether or not we conclude that the older females have a different body fat content. For a detailed, advanced discussion of issues with null hypothesis significance testing, see When Null Hypothesis Significance Testing Is Unsuitable for Research: A Reassessment.

In education and the social sciences there is a tradition of using a 95% confidence interval. In some fields three different confidence intervals are reported, typically a 90%, 95%, and 99% confidence interval. Why not use a 100% confidence interval? The normal and t-distributions are asymptotic to the x-axis. A 100% confidence interval would run to plus and minus infinity. We can never be 100% confident.

In the above example a 95% confidence interval would be calculated in the following way:

n = 12
̅x = 30.53
sx = 8.67
c = 0.95
degrees of freedom = 12 -1 = 11
tc =TINV(0.05,11) = 2.20

25.02 ≤ μ ≤ 36.04

The 95% confidence interval for our n = 12 sample includes the population mean 25.3. We CANNOT conclude at the 95% confidence level that this sample DID NOT come from a population with a population mean μ of 25.3.

Another way of thinking of this is to say that 30.5 is not sufficiently separated from 25.8 for the difference to be statistically significant at a confidence level of 95% in the above example.

In common language, the women are not gaining body fat.

The above process is reduced to a formulaic structure in hypothesis testing. Hypothesis testing is the process of determining whether a confidence interval includes a previously known population mean value. If the population mean value is included, then we do not have a statistically significant result. If the mean is not encompassed by the confidence interval, then we have a statistically significant result to report.

10.2 Hypothesis Testing

In this section the language of hypothesis testing is introduced. A new statistic, the "t-statistic" is also introduced. In this text the choice is made to use two-tailed hypothesis tests. This retains the result found with a confidence interval hypothesis test found in the previous section. This also means that 1 - c = α and 1 - α = c. In hypothesis testing one sets up a binary choice between a hypothesis of no change and a hypothesis that there is a change.

The null hypothesis H₀

The null hypothesis is the supposition that there is no change in a value from some pre-existing, historical, or expected value. The null hypothesis literally supposes that the change is null, non-existent, that there is no change.

In the previous example the null hypothesis would have been H₀: μ = 25.4

The way to read that is to understand the μ as meaning "the sample could have a population mean of 25.4". This does not mean that the population mean IS 25.4, only that the sample could come from a population with a population mean of 25.4.

The alternate hypothesis H₁

The alternate hypothesis is the supposition that there is a change in the value from some pre-existing, historical, or expected value. Note that the alternate hypothesis does NOT say the "new" value is the correct value, just that whatever the mean μ might be, it is not that given by the null hypothesis.

H₁: μ ≠ 25.4

Statistical hypothesis testing

We run hypothesis test to determine if new data confirms or rejects the null hypothesis.

If the new data falls within the confidence interval, then the new data does not contradict the null hypothesis. In this instance we say that "we fail to reject the null hypothesis." Note that we do not actually affirm the null hypothesis. This is really little more than semantic shenanigans that statisticians use to protect their derrieres. Although we run around saying we failed to reject the null hypothesis, in practice it means we left the null hypothesis standing: we de facto accepted the null hypothesis.

If the new data falls outside the confidence interval, then the new data would cause us to reject the null hypothesis. In this instance we say "we reject the null hypothesis." Note that we never say that we accept the alternate hypothesis. Accepting the alternate hypothesis would be asserting that the population mean is the sample mean value. The test does not prove this, it only shows that the sample could not have the population mean given in the null hypothesis.

For two-tailed tests, the results are identical to a confidence interval test. Note that confidence interval never asserts the exact population mean, only the range of possible means. Hypothesis testing theory is built on confidence interval theory. The confidence interval does not prove a particular value for the population mean , so neither can hypothesis testing.

In our example above we failed to reject the null hypothesis H0 that the population mean for the older students was 25.4, the same population mean as the younger students.

In the example above a 95% confidence interval was used. At this point in your statistical development and this course you can think of this as a 5% chance we have reached the wrong conclusion.

Imagine that the 18 to 19 year old students had a body fat percentage of 24 in the previous example. We would have rejected the null hypothesis and said that the older students have a different and probably larger body fat percentage.

There is, however, a small probability (less than 5%) that a 12 element sample with a mean of 30.5 and a standard deviation of 8.7 could come from a population with a population mean of 24. This risk of rejecting the null hypothesis when we should not reject it is called alpha α. Alpha is 1-confidence level, or α = 1-c. In hypothesis testing we use α instead of the confidence level c.

Suppose the null hypothesis H₀ is true...
...and we fail to reject the null hypothesis. This is a correct decision. Over the long haul we will be correct at a rate 1 - α. If α is 5% then we will be correct 95% of the time over many repetitions.
...and we reject the null hypothesis. This is an incorrect decision. Over a long run of repetitions of the same experiment this will occur at a rate of α. This is called a Type I false positive error. In this course this will occur in 5% of the repetitions.

Suppose the null hypothesis H₀ is false...
...and we fail to reject the null hypothesis. This is an incorrect decision. This is a false negative error. In statistics the false negative rate is called "beta" and uses the Greek letter β Note that there is no way to calculate β in the traditional approach to hypothesis testing used in this course. Only that decreasing alpha increases beta. So you can reduce your rate of false positives but the rate of false negatives will increase as a result.
...and we reject the null hypothesis. This is a correct decision. Over the long run this rate will be 1-β 

Hypothesis testing seeks to control alpha α. We cannot determine β (beta) with the statistical tools you learn in this course.

Alpha α is called the level of significance. 1 − β is called the "power" of the test.

The regions beyond the confidence interval are called the "tails" or critical regions of the test. In the above example there are two tails each with an area of 0.025. Alpha α = 0.05

A type I error, the risk of which is characterized by alpha α, is also known as a false positive. A type I error is finding that a change has happened, finding that a difference is significant, when it is not.

A type II error, the risk of which is characterized by beta β, is also known as a false negative. A type II error is the failure to find that a change has happened, finding that a difference is not significant, when it is.

If you increase the confidence level c, then alpha decreases and beta increases. High levels of confidence in a result, small alpha values, small risks of a type I error, leader to higher risks of committing a type II error. Thus in hypothesis testing there is a tendency to utilize an alpha of 0.05 or 0.01 as a way to controlling the risk of committing a type II error.

Another take on type I and type II errors: 

Source information: Jim Thornton via Flowing Data

For hypothesis testing it is simply safest to always use the t-distribution. In the example further below we will run a two-tail test.

Steps

6. Make a sketch

7. If the t-statistic is "beyond" the t-critical values then reject the null hypothesis. By "beyond" we mean larger in absolute value. Otherwise we fail to reject the null hypothesis.

Put another way, if the absolute value of the t-statistic is larger than t-critical (tc), then the result is statistically significant and we reject the null hypothesis.

If |t| > tc then reject the null hypothesis

If |t| < tc then fail to reject the null hypothesis

Calculating the t-statistic in a spreadsheet:

=ABS(AVERAGE(data)-μ)/(STDEV(data)/SQRT(n))

where μ is the expected population mean.

Example 10.2.1

Using the data from the first section of these notes:

7. The absolute value of the t-statistic t at 2.05 is NOT "beyond" the t-critical value of 2.20. In the sketch we can see that the t-statistic is inside of the "confidence interval" that runs from -2.20 to +2.20. Note that here the confidence interval is being expressed in terms of values from the student's t-distribution. We FAIL to reject the null hypothesis H0. We cannot say the older female students came from a different population than the younger students with an population mean of 25.4. Why not now accept H0: μ = 25.4 as the population mean for the 21 year old female students and older? We risk making a Type II error: failing to reject a false null hypothesis. We are not trying to prove H0 as being correct, we are only in the business of trying to "knock it down."

More simply, the t-statistic is NOT bigger in absolute value than t-critical.

Note the changes in the above sketch from the confidence interval work. Now the distribution is centered on μ with the distribution curve described by a t-distribution with eleven degrees of freedom. In our confidence interval work we centered our t-distribution on the sample mean. The result is, however, the same due to the symmetry of the problems and the curve. If our distribution were not symmetric we could not perform this sleight of hand.

The hypothesis test process reduces decision making to the question, "Is the t-statistic t greater than the t-critical value tc? If t > tc, then we reject the null hypothesis. If t < tc, then we fail to reject the null hypothesis. Note that t and tc are irrational numbers and thus unlikely to ever be exactly equal.

Decision making using the t-statistic

When the absolute value of the t-statistic is less than t-critical:

When the absolute value of the t-statistic is more than t-critical:

10.2.2 Another example

A population of marbles has a population mean mass μ of 5.20 grams. A sample of five marbles was randomly selected from the population: 5.2, 4.9, 5.2, 5.7, and 5.9 grams. The sample mean x is 5.380 grams with a sample standard deviation of 0.409. At an alpha α = 0.05, could this sample of marbles have a population mean of 5.20 grams? 

H0: μ = 5.20

H1: μ ≠ 5.20

Pay close attention to the above! We DO NOT write H₁: μ = 5.380. This is perhaps a common beginning mistake. The null hypothesis is whether the population mean for the five run sample could be 5.20.

Note that in the above formula sx/√n is used in the denominator. This is the same as the standard error of the mean, thus an equivalent calculation is to use the standard error of the mean SE in the denominator: (5.38-5.20)/0.183 = 0.98.

7. The absolute values of the t-statistic t of 0.98 is NOT "beyond" the t-critical value of 2.78. We FAIL to reject the null hypothesis H0.

Note that in my sketch I am centering my distribution on the population mean and looking at the distribution of sample means for sample sizes of 5 based on that population mean. Then I look at where my actual sample mean falls with respect to that distribution.

Note too that my t-statistic t does not fall "beyond" the critical values. I do not have enough separation from my population mean: I cannot reject H0. So I fail to reject H0. The five marbles could have come from the population.

10.3 P-value

The p-value is a calculation of the area "beyond" the t-statistic. For two-tailed tests the area beyond the positive t-statistic and the area below the negative value of the t-statistic is considered.

Shaded area = 0.620. Unshaded area = p-value = 0.380. Link

For the example in 10.2.2 the unshaded area under the curve and above the x-axis in the diagram is the p-value. If this unshaded area drops below alpha, which for us is 0.05, then we reject the null hypothesis. Thus the p-value is a third way to determine significance. Some functions that we will meet in the next chapter return only the p-value. And many studies often only cite p-values. To some extent, the p-value has been abused, and a definition of what exactly the p-value means is hard to put into words. Not even scientists can easily explain p-values

In this text the p-value is treated as only informing one whether a result is "surprising" or not. In this text "surprising" is any p-value less than 0.05. If a result is surprising that means that the distance of the sample mean from the proposed population mean is surprisingly large, as in large enough to be statistically significant. Surprising means we reject the null hypothesis. If the p-value is larger than 0.05, the result is not surprising and we fail to reject the null hypothesis.

The p-value is calculated using the formula:

=TDIST(ABS(t),degrees of freedom,number of tails)

For a single variable sample and a two-tailed distribution, the spreadsheet equation becomes:

=TDIST(ABS(t),n−1,2)

The degrees of freedom are n − 1 for comparison of a sample mean to a known or pre-existing population mean μ.

Note that TDIST can only handle positive values for the t-statistic, hence the absolute value function. If you already have a positive t-statistic, the ABS function can be omitted from the formula.

Guidelines for decision making with the p-value

When the p-value is "not surprising" (larger than our chosen alpha):

When the p-value is "surprising" (less than our chosen alpha):

For two-tailed hypothesis testing, 1 − p-value is the confidence interval for which the new value does not include the pre-existing population mean. Another way to say this is that 1 − p-value is the maximum confidence level c we can have that the difference (change) is significant. We usually look for a maximum confidence level c of 0.95 (95%) or higher. Again, the confidence level does not indicate the probability that we are right - on any one test we cannot know if we are right. This means that the p-value is not the probability that you are wrong. Perhaps best to think of the p-value as how surprised one should be.

The p-value is often misunderstood and misinterpreted

The p-value should be thought of as a measure of whether one should be surprised by a result. If the p-value is less than a pre-chosen alpha, usually 0.05, that would be a surprising result. If the p-value is greater than the pre-chosen alpha, usually 0.05, then that would NOT be a surprising result.

The p-value is also a much abused concept. In March 2016 the American Statistical Association issued the following six principles which which address misconceptions and misuse of the p- value, are the following:

American Statistical Association (ASA) statement on statistical significance and P-Values. See also Statisticians Found One Thing They Can Agree On: It’s Time To Stop Misusing P-Values and The mismeasure of scientific significance.

The American Statistical Association settled on the following informal definition of the P-value, "Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value."

Returning to our earlier example in this text where the body fat percentage of 12 female students 21 years old and older was x = 30.53 with a standard deviation sx=8.67 was tested against a null hypothesis H0 that the population mean body fat for 18 to 19 year old students was μ = 25.4. We failed to reject the null hypothesis at an alpha of 0.05. What if we are willing to take a larger risk? What if we are willing to risk a type I error rate of 10%? This would be an alpha of 0.10.

H: μ = 25.4

H: μ 25.4

Alpha α = 0.10 (α = 1 - c, c = 0.90)

Determine the t-critical values: degrees of freedom: n - 1 = 12 - 1; tc = TINV(α,df) = tinv(0.10,11) = 1.796

Determine the t-statistic:

=(30.53-25.4)/(8.67/sqrt(12)) = 2.05

The t-statistic is "beyond" the t-critical value. We reject the null hypothesis H0. We can say the older female students came from a different population than the younger students with an population mean of 25.4. Why not now accept an H1: μ = 30.53 as the population mean for the 21 year old female students and older? We do not actually know the population mean for the 21+ year old female students unless we measure ALL of the 21+ year old students. We can only say what the value is not: it is not 25.4. We cannot say what the value is. This is why we "reject the null hypotheis" instead of "accepting the alternate hypothesis."

With an alpha of 0.10 (a confidence interval of 0.90) our results are statistically significant. These same results were NOT statistically significant at an alpha α of 0.05. So which is correct:

Note how we would have said this in confidence interval language:

The answer is that it depends on how much risk you are willing take, a 5% chance of committing a Type I error (rejecting a null hypothesis that is true) or a larger 10% chance of committing a Type I error. The result depends on your own personal level of aversity to risk. That is a heck of a mathematical mess: the answer depends on your personal willingness to take a particular risk.

Consider what happens if someone decides they only want to be wrong 1 in 15 times: that corresponds to an alpha of α = 0.067. They cannot use either of the above examples to decide whether to reject the null hypothesis. We need a way to indicate the boundary at which alpha changes from failure to reject the null hypothesis to rejection of the null hypothesis.

Citing the p-value gives us a way to provide that option. The p-value is also the smallest alpha for which we would still reject the null hypothesis. Suppose one is using alpha = 0.05. Then any p-value less than 0.05 leads to rejecting the null hypothesis. Suppose one chooses to use alpha = 0.10. Then any p-value less than this value leads to rejection. If the p-value is 0.08, then someone using an alpha of 0.05 does NOT reject the null while someone using 0.10 fails to reject the null hypothesis. For a p-value = 0.08 any alpha down to 0.08 leads to rejection of the null hypothesis while any alpha smaller than 0.08 leads to failure to reject the null hypothesis.

This sounds confusing and this can be confusing. The key point is that one has to choose one's alpha, one's willingness to risk a type I false positive error, before making any calculations. Another solution to this is to keep the same alpha that is consistently used in a particular field of study, often 0.05. With alpha at 0.05, then any p-value less than 0.05 is significant and leads to rejection of the null hypothesis.

For this body fat example the p-value =TDIST(ABS(2.05,11,2) = 0.06501

The p-value represents the SMALLEST alpha α for which the test is deemed "statistically significant" or, perhaps, "worthy of note."

The p-value is the SMALLEST alpha α for which we reject the null hypothesis.

Thus for all alpha greater than 0.065 we reject the null hypothesis. The "one in fifteen" person would reject the null hypothesis (0.0667 > 0.065). The alpha = 0.05 person would not reject the null hypothesis.

If the pre-chosen alpha is more than the p-value, then we reject the null hypothesis. If the pre-chosen alpha is less than the p-value, then we fail to reject the null hypothesis.

The p-value lets each person decide on their own level of risk and removes the arbitrariness of personal risk choices. This is also why alpha should be chosen before data is collected and analyzed. There is a risk of the statistical results influencing a decision on alpha if the choice is made after the analysis.

Because many studies in education and the social sciences are done at an alpha of 0.05, a p-value at or below 0.05 is used to reject the null hypothesis.

10.4 One Tailed Tests

All of the work above in confidence intervals and hypothesis testing has been with two-tailed confidence intervals and two-tailed hypothesis tests. There are statisticians who feel one should never leave the realm of two-tailed intervals and tests.

Unfortunately, the practice by scientists, business, educators and many of the fields in social science, is to use one-tailed tests when one is fairly certain that the sample has changed in a particular direction. The effect of moving to a one tailed test is to increase one's risk of committing a Type I error.

One tailed tests, however, are popular with researchers because they increase the probability of rejecting the null hypothesis (which is what most researchers are hoping to do).

The complication is that starting with a one-tailed test presumes a change, as in ANY change in ANY direction has occurred. The proper way to use a one-tailed test is to first do a two-tailed test for change in any direction. If change has occurred, then one can do a one-tailed test in the direction of the change.

In this course one-tailed tests are not used. A one tailed test has the effect of doubling the risk of a false positive over a two tailed test. If the difference is not significant for a two tailed test, shifting to a one tailed test to attempt to achieve significance in a desired direction of change is inappropriate. 

10.5 Hypothesis test for a proportion

For a sample proportion p and a known or pre-existing population proportion P, a hypothesis can be done to determine if the sample with a sample proportion p could have come from a population with a proportion P. Note that in this text, due to typesetting issues, a lower-case p is used for the sample proportion while an upper case P is used for the population proportion.

In another departure from other texts, this text uses the student's t-distribution for tc providing a more conservative determination of whether a change is significant in smaller samples sizes. Rather than label the test statistic as a z-statistic, to avoid confusing the students and to conform to usage in earlier sections the test statistic is referred to as a t-statistic.

A survey of college students at a college found 18 of 32 had already had sexual intercourse. An April 2007 study of abstinence education programs in the United States reported that 51% of the youth, primarily students, surveyed had sexual intercourse. [That study was no longer available as of 2023.] Is the proportion of sexually active students in the college different from that reported in the abstinence education program study at a confidence level of 95%?

The null and alternate hypotheses are written using the population proportion, in this case the value reported in the study.

H₀: P = 0.51
H₁: P ≠ 0.51

sample proportion p = 18/32 = 0.5625
sample proportion q = 1 − p = 0.4375

Note that n×p must be > 5 and n×q must also be > 5 just as was the case in constructing a confidence interval.

Confidence level c = 0.95

The t-critical value is still calculated using alpha α along with the degrees of freedom: =TINV(0.05,32−1)= 2.04

Note that the standard error for a proportion is:

The only "new" calculation then is the t-statistic t:

Note that the form is still (sample statistic - population parameter)/standard error for the statistic.

=(0.5625-0.51)/SQRT(0.5625*0.4375/32)

=0.5545

The t-statistic t does not exceed the t-critical value, so the difference is not statistically significant. We fail to reject the null hypothesis of no change.

The p-value is calculated as above using the absolute value of the t-statistic.

=TDIST(ABS(0.5545),32-1,2)

=0.58

Larger than 0.05. Not surprising. No difference detected. Fail to reject the null hypothesis. The college is not seeing a significantly different rate than that seen in the United States.