Rethinking Statistical Significance:
The Hidden Link Between Hypothesis Testing and Maching Learning

Sam Blechman, M.S.
August 2025

A set of fundamental concepts in statistics and machine learning can be traced back to a deceptively simple 2x2 table: the confusion matrix. Despite its simplicity, the confusion matrix is a powerful tool that appears in two key domains of statistics: binary classification and hypothesis testing. In both cases, it serves a similar function: it measures how often and in what direction we are incorrect.

However, there is a key distinction in how the confusion matrix is used in these two contexts. Recently, I spent some time carefully thinking about this distinction and, in the process, stumbled upon an intuitive understanding of Bayesian statistics. I want to share my thought process here and, in doing so, attempt to provide an appreciation for the power of Bayesian thought.

Binary classification (machine learning and diagnostic testing)

Let’s define the confusion matrix and learn why it’s useful in two statistical settings: binary classification, where we classify samples as positive or negative, and hypothesis testing, where we decide whether there is a statistically significant difference or not. A “classifier” can take many forms: a diagnostic test (e.g., to detect SARS-CoV-2 infection) or a predictive model (e.g., a convolutional neural network that determines whether an image contains a dog from image pixels).

In classification, the classifier makes a binary prediction, and that prediction is either right or wrong.

It can be correct in two ways:

True positive (TP): Correctly predicting that a sample is positive (e.g., the model labels an image as containing a dog, and a dog is indeed present).
True negative (TN): Correctly predicting that a sample is negative (e.g., the model labels an image as not containing a dog, and no dog is present.

Similarly, it can be wrong in two ways:

False positive (FP): Incorrectly predicting that a sample is positive (e.g., the model labels an image as containing a dog, but there is no dog).
False negative (FN): Incorrectly predicting that a sample is negative (e.g., the model labels an image as dog-free, when a dog is actually present).

From the number of TP, TN, FP, and FN, we construct the confusion matrix:

From this confusion matrix, we can derive a set of metrics that quantify the classifier’s accuracy:

Sensitivity (also called recall or true positive rate) measures how well the classifier detects positives. “Given a dog-containing image, what is the probability that the model correctly detects a dog?”
Specificity (aka the true negative rate) measures how well it confirms negatives. “Given a dog-free image, what is the probability that the model correctly concludes there is no dog?”

These two metrics are intrinsic properties of the classifier. For example, if two image datasets both contain dog and non-dog images, the classifier will have the same sensitivity and specificity, even if the proportion of dog images differs. In contrast, precision (also called positive predictive value) measures: “Given that the model predicted an image contained a dog, what is the probability the image truly contains a dog?” Unlike sensitivity and specificity, precision depends on how common the positive class is (i.e., prevalence).

This matters in low-prevalence settings such as diagnostic testing for rare diseases like cancer. Even a highly accurate cancer diagnostic test (i.e., high sensitivity and high specificity) can produce many false positives if cancer is rare in the population being screened:

The cancer diagnostic test (the classifier) has 95% sensitivity and 95% specificity in both populations. When the cancer prevalence is 50% (left), the precision of the test is 95%. However, in a population where only 1% have cancer (right), the precision drops to 16.1%. This means that roughly five out of six people with a positive test in that population do not have cancer!

Statistical Hypothesis Testing

The confusion matrix also appears in statistical hypothesis testing, though the language and framing are different. Here, we use a statistical test to decide whether to reject or “fail to reject” a null hypothesis—an assumption about the world, like “this new drug is no better than placebo.” The alternative hypothesis says the drug is better (or worse).

Statistical tests help us to decide whether to accept or reject the null hypothesis given data. Just like classification, there are four possible outcomes:

To me, the question naturally arises: what are the analogs of sensitivity, specificity, and precision in relation to the above confusion matrix?

For specificity, let’s think about 1 – specificity (the false positive rate). In this case, the false positive rate answers the question: “Given there is no effect (e.g., the new drug is no better than placebo), what is the probability we falsely conclude there is an effect?” This is the Type I error rate (α).

Unlike specificity for classifiers, the Type I error rate is not an intrinsic property of the test. We actually choose the Type I error rate ahead of time based on our tolerance for false positives. This is α or the significance threshold, and is commonly set to 0.05–meaning we’re willing to accept a 5% chance of incorrectly rejecting a true null hypothesis (aka “being wrong by making up an effect that doesn’t exist”).

The analog of sensitivity is statistical power, or 1 – the Type II error rate. Power1 measures our ability to correctly reject an incorrect null, meaning: “Given there is an effect (e.g., the new drug is better than placebo), what is the probability we correctly reject the null?”

Let’s collect data–we give 50 patients a new blood pressure drug and 50 patients a placebo. Patients are given the pill everyday for 60 days and their blood pressure is measured on day 0 and day 60. We measure the difference in blood pressure between the two groups and find that the new drug group experienced a 10 point greater reduction in blood pressure than the placebo group. Given this data, we perform a statistical test (a Student’s t-test or Wilcoxon rank-sum test) to calculate a p-value = 0.02. In this case, the p-value means: “if the new drug actually has no effect on blood pressure, there is a 2% probability that we would see an average difference of 10 points or more.”

Unlike in classification where model performance is fixed by the data and model, hypothesis testing lets us design studies to manage the trade-off between Type I error rate and power through more accurate measurements or larger samples.

Precision’s Analog: Introducing Bayesian Thinking

So where’s the analog of precision (“Given a positive result, what’s the probability that it’s actually true?”). This is where Bayesian reasoning comes in. In hypothesis testing, the equivalent question is: “Given that we rejected the null hypothesis, what is the probability that there is truly an effect?”

If you take one thing from this article, it should be that classical hypothesis testing is unable to answer this question. Many think it does. It is one of the most widespread misunderstandings in statistics: the p-value does not tell you the probability that the null hypothesis is false. Relatedly, a 95% confidence interval does not mean there is a 95% chance the true value lies in the interval–it either lies in the interval or does not.

However, in Bayesian statistics, we can answer this question by combining:

Prior: the probability of seeing a real effect (i.e., the prevalence of true effects)
Likelihood: the observed data (the sample we collected)

The result is the posterior probability–the analog of precision: it tells us how much to believe a positive result, given the data and our prior knowledge. What does “prior knowledge” mean in this context? Let’s learn by example.

Imagine you're a pharmaceutical company screening 1,000 new compounds for their ability to reduce blood pressure in a preclinical mouse model.

You run a standardized experiment for each drug:

Null hypothesis: The drug has no effect on blood pressure.
Alternative: The drug significantly lowers blood pressure.
You set α = 0.05 (so we accept a 5% false positive rate),
You use enough mice such that Power = 0.80 (so Type II error rate = 0.20, meaning we accept a 20% false negative rate)

However, from prior knowledge, we know that only roughly 1% of drugs (i.e., 10 out of 1,000) will truly have a real effect on blood pressure. This prior is our expected prevalence of true positives. It’s kind of a guess–it doesn’t need to be perfect.

Now let’s run the 1,000 experiments, each time calculating a p-value and either rejecting or failing to reject the null hypothesis about the drug’s effect on blood pressure. Our expected outcomes:

The total number of “hits” our experiment yielded was 58. Without a prior, we have no sense for how many of those 58 drugs that seemed to have an effect on blood pressure truly do have an effect.

However, by plugging in a potential prior (e.g., 1%), we come to a shocking realization: we would expect only 8 of those 58 hits to truly have an effect and we have no idea which 8. In this case, our precision (or posterior probability) = 8 / 58 = 13.8%.

Even though the test is “statistically significant” for 58 drugs, 50 of those results are false positives due simply to the overwhelmingly small prior probability that any drug actually works to reduce blood pressure. This is not poor drug candidate choice, it’s just math.

There’s so much more to say about these topics, but I’ll leave it at this take-home message: precision matters. A positive prediction or statistically significant result alone is not enough. We must consider prior probabilities and possibly use Bayesian thinking to improve our scientific decision-making.

Illustration by Swapnil Keshari
(uses some elements of Generative AI)

1Power, unlike the Type I error rate, is not set directly by the user but is a consequence of multiple factors: the effect size (e.g., how MUCH the new drug is better [or worse] than placebo), the sample size (e.g., the number of patients participating in the trial), the variability in the data (e.g., how well can measure the effect of the drug) and the choice of the Type I error rate.

Page updated

Google Sites

Report abuse

Rethinking Statistical Significance:The Hidden Link Between Hypothesis Testing and Maching Learning

Rethinking Statistical Significance:
The Hidden Link Between Hypothesis Testing and Maching Learning