5. Hypothesis Testing

Learning objectives (and summaries)

Setup a hypothesis test of a population parameter, run the test using simulation, and interpret the resulting p-value in context.

    • Understand the purpose of a hypothesis test

    • To test a claim about the population parameter using sample statistics

  • Understand the logic of a hypothesis test

    • State a null (default) hypothesis and attempt to disprove it by finding the probability of a result as extreme as the one you found. (Yeah, we'll talk about this one a lot.)

    • Properly write null and alternative hypotheses.

    • The null hypothesis sets the parameter equal to a value. For example, µ = 14.

    • The alternate hypothesis sets the direction of expected change: left, right, or two-sided (not equal). For example, µ > 14.

    • Use random simulation to estimate a p-value for proportions

    • Use the StatKey simulator with categorical sample data to create a sampling distribution around the null hypothesis. The p-value is the proportion of values as low/high/extreme as the sample proportion.

    • Use random simulation to estimate a p-value for means

    • Use the StatKey simulator with quantitative sample data to create a sampling distribution around the null hypothesis. The p-value is the proportion of values as low/high/extreme as the sample mean.

    • Explain in non-technical language the meaning of a P-value

    • If the average/proportion of _[your variable]_ were really _[your null]_, our sample mean/proportion of _[your sample mean]_ or _[less/more/more extreme]_ would occur _[100*p-value]_% of the time by chance.1

    • Example: imagine that you assume that the average height is more than 66" tall. You run a test with the null hypothesis µ = 66 and the alternative µ > 66. Your sample average of 69" resulted in a p-value of 0.04.

      • Interpretation: If the average height were really 66", our sample mean of 69" or more would occur 4% of the time by chance.

    • Example: Imagine that a friend claims that half of your school eats cheese weekly. You think she is wrong, so you run a test with the null hypothesis p=0.5 and the alternative p ≠ 0.5. Your sample proportion of 0.61 resulted in a p-value of 0.13.

      • Interpretation: If the proportion of your school that eats cheese weekly were really 0.5, our sample proportion of 0.61 or more extreme would occur 13% of the time by chance.

    • Understand that a test does not measure the size or importance of an effect – it only detects a difference. Explain why a small effect can be significant in a large sample and why a large effect can fail to be significant in a small sample.

    • The p-value gets "better" in larger samples and/or with sample data further from the null hypothesis. A side effect of this is that a meaningless difference can becomes detectable and can reach statistical significance in a large sample.

    • Understand how to choose a level of significance, α, and when you should reject the null.

    • Reject the null hypothesis when the P-value is less than α. If the P-value is greater than α, you do not accept the null, you just “fail to reject” it. When no obvious choice exists for α, use α = .05.

    • Understand Type I and Type II errors and how their probabilities are labeled.

    • A Type I error is the probability of incorrectly rejecting the null hypothesis (incorrectly finding a difference). The probability of a Type I error is the level of significance, α (alpha).

    • A Type II error is the probability of not rejecting the null hypothesis when there was a true population difference. The probability is called β (beta).

1: adapted from samples in the REA AP Statistics book.

Assessment (x core points)

    • Test (13 pts): 10 questions (3 MC, 3 circling, 2 fill-in-blank, 3 written); 1 of these free response questions (3pts):

      • Why does the p-value get small, even when the sample mean/prop and the null mean/prop are really close, when there is a large sample size?

      • Define the p-value (general definition). Then explain what it means in this context: a man claims his average golf score was 72 at a course. You thought it was higher and conducted a study, and you obtained a p-value of .04.

      • ***See subpage "What is a p-value" for some help on this one! It is located on the way bottom of this page, so keep scrolling!***

      • Compare and contrast a confidence interval and a hypothesis test.

    • Explanation video (8 pts): In groups of 1-3, create a 1-2 minute video that clearly explains either a "p-value" or "Type I/II errors" (it will be randomly assigned to your group). This video should be simple and clear enough that a parent or a student not in this class can understand (and I will ask them). The easiest way to do this is using Explain Everything or iMovie on the iPad, but you can use whatever tools you would like. I will hold larger groups to a higher quality standard. You will be graded on:

      • 1pt: video is 1-2 minutes, correctly uploaded to YouTube, and link is emailed to me with subject "statshw"

      • 3pts: creative example chosen to teach others your concept

      • 2pts: movie clearly explains the concept in a way that almost anyone can understand

      • 2pts: movie accurately teaches the concept

Instruction

Printable guided notes: version 1, version 2, PDF

After watching videos, see the student-created subpage for more information on p-values.

Vocabulary

Alternative Hypothesis- the answer that must be true if the null hypothesis is wrong

Null Hypothesis- assumed hypothesis

p- value- the probability of obtaining a statistic the same as the one that was observed, assuming that the null hypothesis is indeed true

type one error- when the null hypothesis is rejected when it is true

type two error- the null hypothesis is not rejected when it is false

Practice

    1. Imagine that you rejected the null hypothesis that a sample had a mean equal to 4.77. You were later informed that the true population had μ = 3.98. Did you commit an error? If so, what type?

    2. You conduct a study at the standard significance level. Your p-value is .023. A larger follow-up study showed that the null hypothesis was true. Did you commit an error? If so, what type?

    3. A person is brought to court and declared not guilty by the jury. This person later admitted to the crime. Was an error committed? If so, what type?

    4. You are developing a new disease test. In this test, the null hypothesis is that you do not have the disease, with the alternative that you do have the disease. You need to balance Type I and Type II errors (decreasing the probability of one type of error increases the probability of another type). In this scenario, which type of error should you increase and which should you minimize? Why?

    5. You are conducting a study of the local environment. After conducting your research, you found that a local waste plant does not make a huge increase in the amount of carbon monoxide in the air. However, you still want to publish “statistically significant” results. Your alpha value is fixed at .05 due to the requirements of the journal you want to submit your paper to. What will you need to do to get statistically significant results?

    6. List the null and alternative hypotheses in a court of law. Explain what Type I and Type II errors correspond to. Why do you think the founding fathers setup the judicial system this way?

    7. In court, when a person is not convicted, they are not declared “innocent” by the jury. They are just declared “not guilty”. Explain this subtle difference and how it connects to hypothesis testing.

    8. Medical experiments frequently use hypothesis testing to show that one treatment group did better than another. How might you setup a test so that the results are most likely to appear significant? Consider your alpha value, sample size, and direction of the test.

    9. A highly biased group wants to debunk the idea that the earth’s CO2 concentration has changed in a statistically significant way over the past 40 years. To do this, they run a series of tests and show results that do not reach statistical significance. How might they design their tests to not reach statistical significance?

For each of the following situations:

    • a) What represents the “evidence” (equivalent to testing positive/negative for a disease)?

    • b) What represents the “actual outcome” (such as having the disease or not)?

    • c) Write the null and alternate hypotheses for each statement.

    • d) Identify the Type I and Type II errors and state the consequences of each type of error.

    • e) Decide which type of error is most harmful/dangerous/bad.

10) Some states’ motor vehicle registration folks assume that a car is unsafe until they do an inspection and certify that it is safe.

11) A school instituted the “well behaved student” policy. Every student is assumed to be well behaved until a written warning proves otherwise.

For each of the scenarios in 12-15 below, answer all of the following questions:

    • a) What simulation test will you use (test of mean or test of proportion)?

    • b) Check your conditions for inference (make sure you use a random sampling method).

    • c) What is the null hypothesis?

    • d) What is the alternative hypothesis?

    • e) Simulate and find the p-value.

    • f) Explain in a clear, concise sentence what this p-value tells you.

    • g) Compare your p-value to the default alpha value of 0.05 (unless a different one is given to you). Based on this comparison, decide whether to “reject” the null hypothesis or to “fail to reject” the null hypothesis. If you failed to reject, is the p-value close enough to warrant a re-test with a larger sample?

    • h) If your test was statistically significant, did the detected difference matter?

12) Imagine that the posted number for the average number of hits per team per game in state high school baseball is 6.5 hits/game. You want to show that your team is truly above average, so you take a random sample of the number of hits your team had in 18 games. Here is the data: 5, 6, 8, 7, 11, 13, 0, 7, 8, 5, 7, 10, 8, 10, 9, 7, 2, 9. [Copy to StatKey from here.]

13) You read somewhere that 53% of people like M&M’s. You want to see if this is true in your school too, so you took an SRS. Your data found that 34 of 88 people liked the candy.

14) You go golfing with some friends. Jamie says that she shoots, on average, a 79 in an 18-hole round. You think she is actually better than that (you think her actual average is lower) and is trying to get an unfair handicap (essentially, she is making herself look bad so the other players give her free points). To prove this, you systematically record sample data from future outings. Here are the scores: 77, 75, 80, 76, 72, 76, 81, 78, 74. [Copy to StatKey from here.]

15) A regional softball league claimed that half of its pitchers could pitch a ball over 65mph. A competing league thought this value was too high, so they looked at a simple random sample of 28 pitchers and found only 10 who could throw this fast.

Practice solutions

    1. No error -- there was a difference in means and you correctly rejected the null hypothesis.

    2. Yes, Type I -- you rejected (p=.023 is less than α=.05) and the null turned out to be true.

    3. Yes, Type II -- jury did not reject, but the null was false.

  1. If your test says that a healthy person is sick, that person will have some unnecessary worry and would undergo more testing. If your test says that a sick person is healthy, that person might die. Thus, you want to minimize the chance that someone who truly has the disease will be left undiagnosed. In this case, the null hypothesis is “healthy”, so you are okay rejecting this too often (higher chance of Type I error) so that you can minimize the times that you fail to reject when you should have rejected (low Type II error).

    1. Since alpha and your effect size are already fixed, you will need a larger sample size.

    2. Null: the defendant is innocent; Alt: the defendant is guilty. Type I error is rejecting the null hypothesis by mistake (convicting an innocent person). Type II error is failing to reject the null hypothesis by mistake (letting a guilty person free). This was setup this way so that government leaders could not arbitrary imprison people that they did like and force a defendant to provide evidence that they are innocent (among many other related reasons). Instead, the evidence must be brought forth by those who want to accuse someone else.

    3. When letting someone free, the jury is saying that they do not have enough evidence to convict someone. They may not be convinced that the person is innocent, but they are also not convinced enough that the person is guilty. The same is true in any hypothesis test -- it is not a question of whether or not you *think* the null hypothesis (H0) is false, it is a question of if you have enough evidence to prove the null so unlikely that the alternative must be true.

    4. A high alpha value allows you find results with a higher p-value “significant”. However, most publications don’t let you set your alpha value above .05. A large sample size is a good way to pick-up a difference and be statistically significant, even if the resulting change is not all that important.

    5. If you are doing a study just to make something look like it is not an issue, you may want to conduct many tests with a small sample size and a low alpha value. The low alpha makes it harder to see significance on smaller effect sizes. The small sample size gives you a higher standard deviation and thus a lower z-score (and therefore higher p-value). Unfortunately, things like this happen in the real world.

    6. a) The inspection (pass or fail) is the evidence.

    7. b) The car actually being safe or not is real outcome.

    8. c) H0: the car is unsafe to drive; HA: the car is safe to drive (I assumed unsafe first because the problem said that the state "assumes" it is unsafe until the inspection).

    9. d) Type I error: incorrectly rejecting -- car passing inspection when it is not actually safe. This means more unsafe vehicles on the road.

    10. Type II error: incorrectly failing to reject -- a safe car not passing inspection. This means that people with safe cars will be prevented from driving their car.

    11. e) Probably Type I, but there would be a lot of angry people if the Type II error rate was too high and people couldn't drive their cars.

    12. a) The student being written up or not is the evidence.

    13. b) The student actually being well behaved or not is the real outcome.

    14. c) H0: the student is well behaved; HA: the student is not well behaved (I assumed behaved first because the problem said that the school "assumes" all are well behaved until they get write-ups.).

    15. d) Type I error: incorrectly rejecting -- well behaved student gets written-up. This seems unfair to that student.

    16. Type II error: incorrectly failing to reject -- a poorly behaved student does not get written up. This means that some naughty kids didn't get caught.

    17. e) Type I is worse -- it is better for the well-behaved student to be fairly rewarded than to make sure to catch every naughty student.

  2. High school baseball

    1. a) Test for Single Mean

    2. b) A random sample usually refers to a SRS, which is a good sampling technique.

    3. c) μ = 6.5

    4. d) μ > 6.5 (right-tailed test)

    5. e) p-value = 0.12

    6. f) If the team was truly average, there is a 0.12 probability of getting a sample average as high as 7.33 (the average of our sample).

    7. g) Fail to reject. Though our sample average is above 6.5, there is not enough evidence beyond random chance that this team is better than the state average. 12% chance is probably not even worth trying to re-study this with a larger sample size.

    8. h) n/a

  1. M&M's:

  2. a) Test for Single Proportion

  3. b) SRS -- yup

  4. c) p = .53

  5. d) p ≠ .53 (two-tailed test)

  6. e) p-value = .008

  7. f) If the true proportion of M&M lovers is 53%, there is a .004 probability of getting a sample proportion as extreme as 34 out of 88

  8. g) Reject. This is a very convincing p-value (far below 0.05) so we can say that the true proportion of people who like M&Ms at your school is different from the national statistic.

  9. h) A difference of 53% (null) to 39% (your p-hat or 34/88) is a pretty big margin.

  10. Golfing

  11. a) Test for Single Mean

  12. b) Systematic is a good sampling method, but it doesn't give any details about how often you sample, so maybe it is just every other outing and they are too clumped together? We'll say its okay for now.

  13. c) μ = 79

  14. d) μ < 79 (This is counter-intuitive, but remember that in golf you want to get a LOWER score, so being "better" mean having a lower average score. This is a left-tailed test)

  15. e) p-value = 0.003

  16. f) If Jamie's average score is really 79, there is a .003 probability that you would get a sample average as low as 76.56.

  17. g) Reject. This is a very small p-value, way smaller than .05. Your friend is almost certainly a liar..

  18. h) A difference of 2-3 strokes per game is a lot significant. Jamie is really trying to take advantage of you.

  19. Softball pitchers

  20. a) Test for Single Proportion

  21. b) SRS -- yup

  22. c) p = 0.5 (half)

  23. d) p < 0.5 (left-tailed test)

  24. e) p-value = 0.09

  25. f) If half of the pitcher really throw over 65mph, there is a 0.09 probability of getting a sample proportion as low as 10/28.

  26. g) Fail to reject. 0.09 is not under 0.05, our alpha value. This is close enough to warrant a study with more data because there probably is a difference, but we just can't detect it with our small sample.

  27. h) n/a

Other Practice Problems and Solutions

1. Imagine that you rejected the null hypothesis that a sample had a mean equal to 2.53. You were later informed that the true population had μ = 3.11. Did you commit an error? If yes, what type?

2. You conduct a study at the standard significance level. Your p-value is .092. A larger follow-up study showed that the null was true. Did you commit an error? If yes, what type?

3. Mr. Whitney is brought to court and declared not guilty by the jury. Out of guilt, Mr. Whitney later admitted to the crime. Was an error committed? If yes, what type?

4. You are developing a test on penalty kicks. In this test, the null hypothesis is that you score, with the alternative that you do not score. You need to balance Type I and Type II errors (decreasing the probability of one type of error increases the probability of another type). In this scenario, which type of error should you increase and which should you minimize? Why?

5. You are conducting a study on sharks. After conducting your research, you find that megalodons in the ocean do not make a huge decrease of whales in the ocean. However, you still want to publish your results. Your alpha value is fixed at .05 due to the requirements of the journal you want to submit your results to. What will you need to get statistically significant results?

6. List the alternative and null hypothesis in the court of law. Explain what Type 1 and Type II errors correspond to. Why do you think it was set up this way?

7. When a person is not convicted of a crime in court, they are declared “not guilty”, instead of “innocent”. Explain the difference.

8. A biased group wants to knock down the idea that the earth’s CO2 concentration has changed over the past 40 years. To do this, they run a series of tests and show results that do not reach statistical significance. How might they design their tests to not reach statistical significance?

9. Scientists frequently use hypothesis testing to show that one treatment group did better than another. How might you setup a test so that the results are most likely to appear significant? Consider your alpha value, sample size, and direction.

10. Some states’ car registration people assume that a car is unsafe until they do an inspection and certify that it is safe.

a. What represents the “evidence”?

b. What represents the “actual outcome”?

c. Write the alternative and null hypothesis.

d. Identify the Type I and II errors

e. Decide which type of error is most harmful.

11. A school made a new “every stupid student” policy. Every student is assumed to be stupid until a written sheet proves otherwise.

a. What represents the “evidence”?

b. What represents the “actual outcome”?

c. Write the alternative and null hypothesis.

d. Identify the Type I and II errors

e. Decide which type of error is most harmful.

12. Imagine that the number for the average number of stolen bases per team per game in state high school baseball is 4 SB/game. You want to show that your team is truly above average, so you take a random sample of the number of SB your team had in 18 games. Here is the data: 3, 6, 4, 5, 2, 8, 5, 6, 7, 10, 7, 1, 8, 5, 5, 6, 4, 5 (Use StatKey).

a) What simulation test will you use?

b) Check your conditions for inference.

c) Null hypothesis?

d) Alternative hypothesis?

e) Simulate and find the p-value.

f) Explain in a sentence what this p-value tells you.

g) Compare your p-value to the default alpha value of 0.05 (unless a different one is given to you). Based on this comparison, decide whether to “reject” the null hypothesis or to “fail to reject” the null hypothesis. If you failed to reject, is the p-value close enough to warrant a re-test with a larger sample?

h) If your test was statistically significant, did the detected difference matter?

13. You read in a magazine that 62 people like Skittles more than M&M’s. You want to see if this is true, so you take an SRS of your school. You find that 67 of 78 liked skittles better.

a) What simulation test will you use?

b) Check your conditions for inference.

c) Null hypothesis?

d) Alternative hypothesis?

e) Simulate and find the p-value.

f) Explain in a sentence what this p-value tells you.

g) Compare your p-value to the default alpha value of 0.05 (unless a different one is given to you). Based on this comparison, decide whether to “reject” the null hypothesis or to “fail to reject” the null hypothesis. If you failed to reject, is the p-value close enough to warrant a re-test with a larger sample?

h) If your test was statistically significant, did the detected difference matter?

14. You go golfing with some of your pals. Ian says that he shoot a 108 on average in an 18-hole round. You think he is actually worse than that and is trying to make himself sound better than he actually is. To prove this, you systematically record sample data from future outings. Here are the scores: 110, 118, 131, 81, 103, 123, 101, 98. (Use Statkey).

a) What simulation test will you use?

b) Check your conditions for inference.

c) Null hypothesis?

d) Alternative hypothesis?

e) Simulate and find the p-value.

f) Explain in a sentence what this p-value tells you.

g) Compare your p-value to the default alpha value of 0.05 (unless a different one is given to you). Based on this comparison, decide whether to “reject” the null hypothesis or to “fail to reject” the null hypothesis. If you failed to reject, is the p-value close enough to warrant a re-test with a larger sample?

h) If your test was statistically significant, did the detected difference matter?

15. A state football league claimed that only half of its quarterbacks could throw over 50 yards. A competing league thought this value was too low, so they looked at a simple random sample of 57 QB’s and found 51 could throw over 50 yards.

a) What simulation test will you use?

b) Check your conditions for inference.

c) Null hypothesis?

d) Alternative hypothesis?

e) Simulate and find the p-value.

f) Explain in a sentence what this p-value tells you.

g) Compare your p-value to the default alpha value of 0.05 (unless a different one is given to you). Based on this comparison, decide whether to “reject” the null hypothesis or to “fail to reject” the null hypothesis. If you failed to reject, is the p-value close enough to warrant a re-test with a larger sample?

h) If your test was statistically significant, did the detected difference matter?

Extra Practice Answers

1. No error, there was a difference in means and you correctly rejected the null hypothesis.

2. Yes, Type I, you rejected the original and the null turned out to be true.

3. Yes, Type II, jury did not reject, but the null was false.

4. Not sure, higher chance of type II error, lower chance of type II.

5. Since alpha and your effect size are already fixed, you will need a larger sample size.

6. Null: the defendant is innocent, alt: the defendant is guilty. Type I error is rejecting the null by mistake. Type II error is failing to reject the null.

7. The jury doesn’t have enough substantial evidence to call the defendant innocent.

8. A high alpha value allows you to find results with a higher p-value. However, most publications dont let you set your alpha value above .05.

9. If you are doing a study just to make something look like it is not an issue, you may want to conduct many tests with a small sample size and a low alpha value.

10. a) The inspection is the evidence.

b) The car actually being safe or not is real outcome.

c) Ho: the car is unsafe to drive; HA: the car is safe to drive.

d) Type I error: incorrectly rejecting

Type II error: incorrectly failing to reject.

e) Type I, but there will be a lot of angry people if the Type II error rate was too high

11. a) The student being written up or not

b) The student actually being well behaved or not.

c) Ho: the student is well behaved, HA: the student is not well behaved

d) Type I error: well behaved student gets written-up, Type II error: a poorly behaved student does not get written up.

12. High school baseball

a) Test for single mean

b) SRS

c) 6.5

d) right tailed

e) p-value = 0.12

f) If the team was truly average, there is a 0.12 probability of getting a sample average as high as 7.33

g) Fail to reject.

13. Skittles

a) Test for single proportion

b) SRS

c) .53

d) two tailed

e) p-value = .008

f) If the proportion is 53%, there is a .004 probability

g) Reject

h) A difference of 53% to 39%.

14. Golfing

a) Test for single mean

b) Systematic

c) 79

d) left tailed

e) 0.003

f) .003 probability that you would get a sample average as low.

g) Reject

h) A difference of 2-3 strokes per game is a lot significant.

15. Football QB’s

a) Test for single proportion

b) SRS

c) 0.5

d) left tailed

e) 0.09

f) .09 probability of getting a sample proportion as low as you did.

g) Fail to reject.

Notes