Intro to Hypothesis Testing

The total length of the videos in this section is approximately 19 minutes, but you will also spend time answering short questions while completing this section.

You can also view all the videos in this section at the YouTube playlist linked here.

The Lady tasting tea

IntroToHypTest.1.Intro.mp4

The lady was Dr. B. Muriel Bristol, a colleague of Fisher's. Here is a photo of her in an article from Significance Magazine, along with more information about the history of this problem:

Imagine that your drink order at a party became the basis for an idea as foundational as the p-value!

Question 1: The lady might have correctly identified 0, 1, 2, 3, or 4 milk-first cups. Which of these numbers, if observed, would convince you that the lady is indeed able to distinguish between milk-first and tea-first cups?

Show answer

There is no correct answer here, but you should be checking the higher numbers, like 4 or maybe 3, if you check any at all. As we move forward, we'll quantify how surprised we'd be if she chose all 4 cups correctly by chance, 3 cups correctly and 1 cup incorrectly, by chance, etc. Your answer to this question implies an opinion about how to interpret p-values, as we will see.

Question 2: Suppose that you know the lady is not able to distinguish milk-first and tea-first cups. If you had to bet on how many of the four cups she labels milk-first will be correct, what number would you bet on?

Show answer

2 cups. I suppose any answer could be right, because you can bet however you want, but if you are trying to maximize your chances of winning the bet, you'd note that the most likely outcome, if the lady is guessing, is that she'll choose half of the cups correctly.

The Lady can't taste tea

IntroToHypTest.2.NullHyp.mp4

Question 3: Consider all the subsets consisting of 4 cups that could be chosen from the 8 cups. How many of these subsets contain 4 milk-first cups?

Show answer

1

Counting

IntroToHypTest.3.Counting.UsingShortened.mov

Question 4: What is the probability that the lady will correctly choose exactly half of the milk cups, assuming that she is just guessing randomly?

Show answer

If she chooses 2 of the 4 milk cups correctly, she is half right. The probability that this occurs is 36/70, assuming that she is guessing randomly.

Optionally, you can refresh your memory of combinations/permutations here. If the videos and the problem above make sense to you, it's not important (for this course) for you to be able to calculate the number of ways to allocate 2 milk and 2 tea cups, etc.

Comparing truth to distribution

IntroToHypTest.4.ComparingTruthToDistribution.mov

Question 5: If something happens that should only happen one out of every 70 times, are you surprised?

Show answer

There is no answer, it's up to you!

Hypothesis Tests

IntroToHypTest.5.Null Hypothesis & Proof by ContradictionUsingShortened.mov

Question 6: Suppose a neuroscientist compares brain measurements from two types of mice. Based on five mice of each type, she calculates the difference in median measurements between the two groups. Can the difference she calculates be called the null hypothesis, the test statistic, or the p-value?

Show answer

Test statistic. A statistic is something you can calculate from the sample that you actually have. Any statistic can be a test statistic, though some statistics are more useful for tests than others. The null hypothesis is an assumption about the target population.

Example: Harvard Legal Aid Bureau Evaluation

QAI 2.06_ Intro to HLABUsingShortened.mp4

Question 7: The difference in win rates between those who are offered help from HLAB and those who are not offered help from HLAB depends on which people are randomized to which group. Is this uncertainty due to sampling from the target population or assignment to treatment groups within a sample?

Show answer

Assignment to treatment groups within a sample. This is an example of a study where the treatment groups have been randomized, but the sample is not necessarily representative of the target population. The target population could be defined as all people who are facing unemployment hearings and cannot afford lawyers, but the sample contains only people who called HLAB for help. Alternatively, we could define the target population to be the set of people who call HLAB, and then our sample is the same as our target population. Either way, the people in this study are representative of those who call HLAB, not of all the people who need help.

In case you're interested in law or this example, the (long) HLAB paper is here

Ethical issues surrounding R. A. Fisher and current discussions about p-values

R. A. Fisher was a eugenicist. In 2020, the Committee of the Presidents of Statistical Societies "retired" their most prominent award, which was named after Fisher, for this reason.  This article discusses Fisher's history in more detail. There are important ongoing conversations about how statistics and other fields should handle the remaining influences of scholars who were actively racist and whose scholarly contributions were sometimes separate from but often intertwined with their racism. Interestingly, the academic community is talking about de-emphasizing p-values at the same time that we are talking about de-emphasizing Fisher, who first proposed p-values.

You may know that the most common convention is to use 0.05 as a cutoff for the p-value of a hypothesis test, and it's helpful to have a convention, but there is no particular justification for any cutoff. In fact, there has been a lot of discussion in recent years about the drawbacks of hypothesis testing as the primary way to report study results. The main idea is that p-values do not include all of the information about study results: they produce exactly the number that is their definition, the probability that we'd see results at least as extreme when some null hypothesis is true. But, it's not very helpful to report a p-value without also visualizing the data, reporting the means/medians in each groups, perhaps calculating a confidence interval, etc. Over time, some fields have become over-reliant on comparing each p-value to 0.05 as the primary way to identify important results, but that is not what p-values are meant to do. One psychology journal has even banned p-values, which is a bit of an over-correction! The journal The American Statistician recently published an issue with guidelines for researchers who want to avoid over-relying on p-values. The opening article in that issue is an interesting read - I recommend it: "Moving to a world beyond p<0.05."

We will come back to the drawbacks of strict p-value cutoffs when we talk about practical v. statistical significance and when we talk about multiple comparisons.

In this second-level statistics course, I think it is important for you to have a solid understanding of hypothesis testing and p-values not only because they are still the most popular approach to data analysis, but also because you can't fully engage in conversations about the appropriate role of p-values and how to handle Fisher's legacy without understanding the underlying concepts.

Question 9: Very briefly, what do you think is the best way for society to handle influential contributions by people with unacceptable, offensive, and harmful views?

Show answer

Obviously, I don't know the answer. You can see from this module that my choice is to present the ethical issues together with the ideas.

Now you are ready to move on to a discussion of non-parametric hypothesis tests. Please proceed!

During this tutorial you learned:


Terms and concepts:

Hypothesis test, reference distribution, p-value, null hypothesis, proof by contradiction, test statistic