The p-value

Criticisms of Null Hypothesis Testing and p-value

Let H0 and H1 denote the null and alternative hypotheses, respectively. Let D denote the data observed. The hypothesis testing is based on the calculation of the probability that such a data set D and more extreme data can be observed given the hypothesis is true:

P(D|H0) = p

which is often called p-value. In Fisher's formulation, a low p -value means either that the null hypothesis is true and a highly improbable event has occurred, or that the null hypothesis is false. If the probability p

is smaller than, typically, 0.05, one would argue that there is a very small chance that the data set D (or more extreme data) can be observed. Thus, the null hypothesis should be rejected. Otherwise, one fails to reject the null hypothesis.

There have been criticisms since the use of the null hypothesis testing (NHST). We summarize the main criticisms below.

The focus of null hypothesis

In NHT, the null hypothesis is a statement of no effect, no difference, or no relation. However, researchers often believe the null hypothesis is false and are interested in finding certain effect. If the null hypothesis is always false, then what's the point to reject it? As Cohen put it:

The null hypothesis, taken literally (and that's the only way you can take it in formal hypothesis testing), is always false in the real world. It can only be true in the bowels of a computer processor running a Monte Carlo study (and even then a stray electron may make it false). If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null is always false, what’s the big deal about rejecting it?

Similarly, Meehl (1990) stated that everything is related to everything else. He argued that the "pairwise correlations of even arbitrarily chosen variables in most soft domains tend to run large enough to yield frequent pseudoconfirmations of unrelated substantive theories, given conventional levels of the statistical power function based on pilot studies" (p. 237). Tukey (1991) wrote that “It is foolish to ask 'Are the effects of A and B different?' They are always different - for some decimal place.”

The test is done given that the null is true

In general, researchers choose to conduct a study because they believe there exists a significant effect. Therefore, they firmly believe that the null hypothesis is not true but the alternative hypothesis is true. However, the null hypothesis testing is conducted based on the null hypothesis. The whole rationale is "Assuming that the null hypothesis is true, what is the probability of observing a value for the test statistic that is at least as extreme as the value that was actually observed?"

The logic of NHST is flawed

In logic, contraposition means that a conditional statement is logically equivalent to its contrapositive. For example, given the statement that "if A is true, then B is true", its equivalence is "if B is not true, then A is not true."

In using NHST, one tends to reason in the following way. If the null hypothesis is correct, then the data can not be observed. However, since we have observed the current data, the null hypothesis is false. This is the contraposition and therefore logically correct.

However, the logic of NHST is not like this. Its logic is as follows. If the null hypothesis is correct, then the data are highly unlikely to observe. Now, these data have been observed. Therefore, the null hypothesis is highly unlikely to be correct. If this sounds right to you, consider the following example (Cohen, 1994):

If a person is an American, then he is probably not a member of Congress. This person is a member of Congress. Therefore, he is probably not an American.

Clearly, the conclusion of the example cannot be more wrong. Therefore, the logic of NHST is flawed.

NHST tells nothing about the probability of neither null hypothesis nor alternative hypothesis

Ultimately, a researcher is interested in P(H1|D) or at least P(H0|D). However, NHST only provides P(D|H0). Generally speaking, P(H0|D) doesn't equal P(D|H0), meaning that rejecting the null hypothesis says nothing or very little about the likelihood that the null is true. Furthermore, P(H1|D) doesn't equal P(D|H0) or even P(H0|D) doesn't equal (1-P)D|H0)). Therefore, rejecting the null hypothesis in no means suggest the support of the alternative hypothesis. however, too many times, people incorrectly believe that it does. Doing so can be extremely dangerous.

Consider the following example. Suppose a crime has been committed and blood is found at the crime scene that is directly related to the crime in a city of 800,000 residents. Statistics shows that the type of blood is present in 1% of the population. Given a person is found to have this type of blood, one wants to infer whether the person is innocent (H0) or not (H1). Using the idea of NHST, we can get


P(blood|H0) = 0.01

Given it's smaller than 0.05, one may reject the null that the person is innocent. However, it does not suggest the person is actually innocent or guilt at all. Note that with 800,000 residents, there are 8,000 residents with this type of blood. If we assume everyone has the equal chance to commit the crime, a person only has a 1/8,000 probability to be guilt, close to 100% to be innocent.

NHST does not tell whether a result can be replicated

Too many times, researchers would wrongly believe that a significant result indicates the rejection of the null hypothesis in a replication study. The probability to replicate a study is related to power π=P(reject H0|H1). However, a p-value does not indicate much, if there is any, about replication. For example, for a one-tail one-sample t-test, its power is

π=1−Φ((−δ√n)+c1−α)


where


Φ is the normal distribution function

δ is effect size

n is sample size


Suppose in a study, we observe a p-value 0.01 which is used as α in the above formula. Then if the effect size is 0.036, the power for a sample with size 100 is only about 0.1. This means that among 10 replication studies, there will be only one study to show significant results. Therefore, it never means there is a 99% (1-0.01) probability to have significant results.


p-value, effect size and sample size

p-value in NHST is at least related to effect size and sample size. A small p-value does not necessarily indicate large effect size. Therefore, a smaller p-value certainly does not mean more important findings. With a large enough sample size, one would always reject the null hypothesis no matter how small the observed difference is practically. Consider a two-sample t-test, the test statistic is

The choice of significance level at 0.05

The choice of significant level at 0.05 or other values has no foundation and is almost completely subjective.

To summarize, in using NHST, one should be very cautious in interpreting the p-value. Bear in mind that:

The p-value is not the probability that the null hypothesis is true, nor is it the probability that the alternative hypothesis is false. It is not connected to either of these. In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses.

  • The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called prosecutor's fallacy.

  • The p-value is not the probability that a replicating experiment would not yield the same conclusion. Quantifying the replicability of an experiment was attempted through the concept of p-rep.

  • The significance level of the test, such as 0.05, is not determined by the p-value.

  • The p-value does not indicate the size or importance of the observed effect.


References:

  1. Zhang, Z. & Wang, L. (2017). Advanced statistics using R. [https://advstats.psychstat.org]. Granger, IN: ISDSA Press. ISBN: 978-1-946728-01-2. https://advstats.psychstat.org/book/hypothesis/pvalue.php

  2. T. Levine, R. Weber, C. Hullett, H.S. Park, and L.L.Massi Lindsey (2018). A Critical Assessment of Null Hypothesis Significance Testing in Quantitative Communication Research. Human Communication Research 171–187 https://msu.edu/~levinet/NHST1.pdf