# HOW DO DATA COMPARE WITH HYPOTHESES?

### Comparisons require statistics.

Comparison is central to hypothesis testing. Specifically, the predictions made by Measurable Hypothesis must be compared to experimentally-collected data.

Typically, comparisons involve reasoning like:

"Our General Hypothesis is: the average value for [some measurement] will be higher for individuals who experience Condition X than the average value for [some measurement] in individuals who experience Condition Y.

Question: Is the average value of [some measurement] actually higher for experimental Group X (who experienced Condition X) than for Group Y (who experienced Condition Y)?"

Answering the question posed by the General Hypothesis could potentially involve several additional questions, including:

1) How much higher must an average value of [some measurement] *BE* for Group X relative to Group Y in order to have *confidence* that Condition X actually leads to higher average values than Condition Y?

2) Even if we can be confident that Condition X leads to higher average values than Condition Y, do the higher values *matter* in some meaningful sense?

BOTH Question (1) and Question (2) should be addressed in a scientific paper! Testing hypotheses (Question 1) is necessary, but arguing that research findings are important (Question 2) is also necessary.

**Statistics** allows Question (1) to be answered in a (relatively) objective way that does not require subjective judgment. Therefore, statistics makes question (1) appropriate for the Results section of a scientific paper.

However, statistics cannot help to answer Question (2). Addressing Question (2) most often requires interpretation of data with reference to other studies or points of reference. Therefore, Question (2) is completely separate from Question (1). Putting data into perspective in the Results can provide information useful to address the importance of any differences (i.e. Question 2) in the Discussion.

**Finding statistical differences between or among groups do not in and of themselves indicate whether the differences *** matter*.

Significant differences do not necessitate *substantial* or *meaningful* differences. For example, some studies have suggested that eating meat significantly decreases life expectancy relative to diets low in meat (Singh et al., 2003). Should people stop eating meat to increase their life expectancy?

If the magnitude of decreases in life expectancy average approximately 3 years, different people might make very different decisions about how much 3 years of life matters relative to the quality of life of eating meat. The statistical question of whether eating meat causes significantly lower life expectancy is totally separate from the question of the relative value of eating meat relative to lifespan.

**There is no such thing as a "non-significant difference" or "trend" in the Results section.**

There are many statistical frameworks that involve different approaches to analyzing and interpreting data (Goodman, 2016). Different statistical frameworks can result in fundamentally different ways of thinking about and conducting science. Statistics could potentially help to formalize the process of making decisions based on probabilities that account for sources and consequences of uncertainty (Goodman, 2016).

However, experimental research most often uses relatively simple, criterion-based, “frequentist” statistical tests to evaluate hypotheses. Therefore, our discussion will be limited to the most common statistical procedures: parametric tests and evaluations of P values (e.g. t-tests, ANOVA, etc.).

Correctly performing even standard, parametric statistical tests is notoriously tricky. There are many variables to consider: properties of the data, a variety of possible statistical tests (all with different assumptions), different ways to normalize and transform data, etc. Moreover, there are many opportunities for confirmation bias to affect statistical tests and interpretation (sometimes called "P hacking"). For example, researchers may collect data until comparisons achieve statistical significance and then stop, or perform many statistical or experimental tests and only report significant outcomes. "P hacking" can contribute to the larger problem of "publication bias," where only significant results are successfully published. Clearly, statistical tests are not the simple, unambiguous criteria for making comparisons that we would like them to be.

Ethical scientists try to perform the most appropriate statistical tests possible (often working with statisticians to do so). The study design and the data determine the appropriate type of statistical tests. Moreover, there is a common convention that the probability of Type I error (finding a significant difference when none actually exists) should be at most 5% (P < 0.05). In some situations, 5% is too high of a potential for error, and lower thresholds (e.g. 1% or 0.1%) are more appropriate.

Performing statistical tests that result in P values slightly above the conventional threshold of 0.05 (for example, 0.06) is *frustrating*. P values close to 0.05 are so frustrating that some scientists try to consider them as statistically significant based on their proximity to 0.05, referring to "non-significant differences" or "trends" in the data.

A "non-significant difference" is simply an oxymoron. Determining what constitutes a "trend" involves interpreting the results of the statistical tests. Therefore discussing "non-significant differences" or "trends" is NOT appropriate for the Results section, and "trends" can NOT be used to test Measurable Hypotheses in the Results. The Discussion section provides opportunities for more complex arguments that can weigh the probability that the statistical tests resulted in Type II errors based on assumptions or limitations of the study. The Discussion also provides opportunities for interpreting statistical probabilities in a more nuanced way than simply criterion-based rejection (Goodman, 2016).

When using criterion-based statistical tests, differences among groups must be supported by statistical tests that demonstrate significant differences (to a confidence level of at least 0.05). Comparisons without statistical support cannot be objectively described as "different." Comparisons that do not reach the (pre-determined) confidence level also cannot be objectively described as "different."

Moreover, the terms "different," "higher," "lower," "increased," "decreased," "greater than," "less than," and any other comparisons are **only** acceptable when referring to statistically significant differences.

**Scientists often use statistics to ****support**** Measurable Hypotheses.**

An extensive discussion of statistics is clearly outside the scope of our discussion. (please refer to the companion modules on Statistics and Research Methods for more information). Appropriate study design and selecting appropriate statistical tests are important issues, and require considerable training and thought (and often consultation with a statistician). However, one *limitation* to statistics that is relevant to our discussion is:

Statistical tests alone can be evidence for differences between or among groups to a particular level of confidence (often indicated by the P value). However, the failure of criterion-based statistical tests (alone) is NOT strong evidence for the absence of differences between or among groups (without additional analyses such as interval or power analysis; Amrhein et al., 2019). The asymmetry of statistical tests is analogous to the aphorism, "the absence of evidence is not the evidence of absence." Therefore, statistics are formally used to test null hypotheses: the hypothesis of NO difference between or among groups.

However, null hypotheses are cumbersome. For example, consider our General Hypothesis "the average value for [some measurement] will be higher for individuals who experience Condition X than the average value for [some measurement] in individuals who experience Condition Y." To pose our General Hypothesis as a Null hypothesis, we could negate the hypothesis: "the average value for [some measurement] will NOT be higher for individuals who experience Condition X than the average value for [some measurement] in individuals who experience Condition Y." We could then construct a deductive argument using *modus tollens*:

PREMISE: If our General NULL Hypothesis is true then we would NOT expect a significant difference in [some measurement] between experimental groups X and Y.

PREMISE: We DO find a significant difference in [some measurement] between Group X and Group Y.

CONCLUSION: Therefore, our experiment rejects our Measurable (and General) Null Hypothesis.

The syllogism is reasonable, but involves a lot of negatives (rejecting a null hypothesis). Instead of sticking with the formality of Null hypotheses, scientists often take some shortcuts. Scientists may take an "inverse" (of sorts) of the above syllogism, to create the more positive deductive argument:

PREMISE: If our General Hypothesis is true then we would expect to observe a significant difference in [some measurement] between experimental groups X and Y.

PREMISE: We DO find a significant difference in [some measurement] between Group X and Group Y.

CONCLUSION: Therefore, our experiment supports our Measurable (and General) Hypothesis.

Does the second syllogism seem like a valid argument?

If you object that the second syllogism seems an awful lot like affirming the consequent... you concerns are warranted! The argument IS structured similarly to affirming the consequent, and is therefore in danger of being a logical fallacy.

Why would scientists routinely use arguments that could be fallacies?

The shortcut that scientists are actually using is combining *modus tollens* and Strong Inference. Scientists are considering the Null hypotheses to be alternatives to the General and Measurable Hypotheses. Rejecting the Null Hypotheses (i.e. in the first syllogism) DOES reject an alternative hypothesis, and can therefore be considered to "support" the Measurable and General Hypotheses through Strong Inference (second syllogism).

**Hypotheses ****cannot**** be experimentally "accepted" or "proven."**

Terminology becomes (unfortunately) important when discussing hypotheses. The conclusion to "support" a hypothesis is acceptable if we consider the word "support" to mean rejecting at least one alternative (like the Null Hypothesis). However, "supporting" a hypothesis does NOT imply a claim that a hypothesis is true -- simply that the hypothesis has not been rejected *YET*.

Stronger terminology like "accepting" or "proving" hypotheses are NOT appropriate, because "accepting" or "proving" implies that the hypothesis has been found to be true. Hypotheses cannot be declared unquestionably true using either deductive or inductive reasoning. Strong Inference cannot reject all possible alternatives, and inductive reasoning cannot lead to proof or truth. Therefore, although "proof" is available in closed systems like mathematics, and "accepting" hypotheses may be terminology sometimes used for statistical hypotheses, "proving" or "accepting" hypotheses is not possible for the messy world of experimental research.

**The Results can include comparisons to test Measurable Hypotheses, but ****not**** General Hypotheses.**

For specificity, both General and Measurable hypotheses have been part of our present discussion of hypothesis testing. However, the Results section of a scientific paper need only address the Measurable Hypotheses. One task of the Introduction (or potentially Methods) is to explain how each General Hypothesis leads to each measurable prediction (Measurable Hypothesis). Because Measurable Hypotheses can be tested using objective, statistical comparisons that do not require interpretation, testing Measurable Hypotheses is appropriate for the Results. However, testing General Hypotheses most often requires judgment, and therefore must be left to the Discussion section.