# MODUS TOLLENS

The logical framework Modus tollens can be used to reject hypotheses, but using modus tollens with statistical tests can present a problem.

Imagine that we were interested in determining whether non-repetitive practice (such alternating between studying subjects instead of studying each subject in one long “block”) leads to more learning than blocked practice. We could imagine making a specific prediction based on a General Hypothesis. For example, we could predict:

PREMISE 1: IF non-repetitive study results in more learning than blocked study of mathematics skills,

THEN serial study (one type of non-repetitive study) will result in significantly higher scores on algebra, geometry, and word problem tests than blocked study during retention tests.

The first part of the premise is very general, implying that all types of non-repetitive study result in more learning than blocked study of mathematics skills. Therefore, the first part of the premise could be thought of as a General Hypothesis. The second part of the premise is a specific prediction (out of MANY possible). The second part of the premise is therefore one possible Measurable Hypothesis.

We could then perform an experiment, collect data, perform statistical tests, and find:

PREMISE 2: Average retention test scores on algebra, geometry, and word problem tests using serial study were significantly higher than average scores using blocked study during retention tests (P < 0.05).

We would like to make the conclusion:

CONCLUSION: Serial study results in more learning than blocked study of mathematics skills. Non-repetitive study therefore results in more learning than blocked study of mathematics skills.

Would our conclusion be justified?

The conclusion seems reasonable at first. However, you might also notice that the conclusion is actually NOT justified, because it is structured to affirm the consequent. Just because serial study resulted in higher scores than blocked study for the experiment we conducted does NOT mean that the difference was because the general hypothesis that non-repetive study results in more learning than blocked study. There could potentially be other reasons for the differences that we observed in our experiment.

We need to use a valid logical structure to test our hypotheses.

The syllogism modus tollens can be used to reject hypotheses.

We can remember that modus tollens is a valid deductive syllogism. Modus tollens takes the form:

PREMISE 1: If A then B.

PREMISE 2: B is NOT true.

CONCLUSION: Therefore, A is NOT true.

How can we use modus tollens to test hypotheses?

Our first premise can be the same: we seek to test a general hypothesis with a specific experiment that tests a measurable hypothesis:

PREMISE 1: IF non-repetitive study results in more learning than blocked study of mathematics skills,

THEN serial study (one type of non-repetitive study) will result in significantly higher scores on algebra, geometry, and word problem tests than blocked study during retention tests.

However, we use modus tollens to reject hypotheses. Imagine that our experiment turned out the other way, and there were NO significant differences between study strategies. We would therefore have the second premise:

PREMISE 2: Average retention test scores on algebra, geometry, and word problem tests using serial study were NOT significantly higher than average scores using blocked study during retention tests (P > 0.05).

Using modus tollens, we could come to the conclusion:

CONCLUSION: Serial study does NOT result in more learning than blocked study of mathematics skills. Non-repetitive study does not result in more learning than blocked study of mathematics skills.

Is the argument a valid deductive argument and a form of modus tollens?

Logically, the argument is valid because it does have the form of modus tollens. If our second premise is also true and leads to the conclusion that serial study does not always result in more learning of mathematics skills than blocked study, then modus tollens seems to provide the opportunity to reject general hypotheses even based on a single experiment.

Is the argument a sound deductive argument?

The argument will be sound if the second premise is true. You might argue: "how could we question its truthfulness without actually seeing the data?" You would have a legitimate point. HOWEVER, there is one problem with Premise 2 that doesn't depend on the data.

The problem with Premise 2 is that in common practice, statistical tests (like t-tests, ANOVA, etc.) are asymmetrical. Statistical tests CAN test for differences among groups to a specified level of confidence (e.g. 95%). A “significant difference” means that the observed differences between groups are unlikely to have happened by chance, and could therefore be meaningful.

However, if a statistical test fails to find significant differences among groups, then the statistical test has simply failed. A failed statistical test is NOT strong evidence of the absence of differences among groups. Typically, a “failed” statistical test simply means that apparent differences between groups could be due to chance alone – but could still also be due to some more meaningful distinction between groups. Therefore, a failed statistical test is commonly interpreted as: we still don't know if there is a significant difference between groups or not.

For example, solely because our statistical test failed to find a significant difference between the serial study and blocked study groups (Premise 2), we cannot conclude (within our agreed-upon 95% confidence) that serial study and blocked study are NOT different. All we can conclude is that our statistical test failed to find a significant difference between groups: we still do not know if there is a difference between serial and blocked practice or not! Therefore, Premise 2 is a non-sequitur. The failure of a statistical test does NOT reasonably lead to the conclusion that serial study results in the same amount of learning than blocked study of mathematics skills.

WHY can we NOT conclude that two groups the same if a statistical test fails to find a significant difference?

The reason that we cannot come to firm conclusions based solely on the absence of significant differences is because there are many ways for statistical tests to fail. A true lack of statistical differences between groups is only one potential reason that a statistical test can fail. Other common reasons for a "false negative" are:

* Sample sizes too small to detect differences between groups (lack of statistical "power").

* Violating one of the assumptions of parametric statistical tests (e.g. non-normal distribution).

* Outliers in the dataset that substantially increase the variance of one or more groups.

* Co-variation among variables that increase variance.

Additional analysis such as interval identification or power analysis can determine the probability of Type II error; Giere, 2006. Completely different statistical frameworks such as Bayesian statistics can provide less categorical statistical comparisons (Höfler et al., 2018). However, a more extensive or nuanced approach to statistics is outside the scope of the current module.

We seem stuck! Both our attempts to test hypotheses encountered serious flaws:

· If the data are consistent with our predictions (measurable hypotheses), then logic is a problem. “Accepting” a hypothesis if the data are consistent with predictions is a logical fallacy: affirming the consequent.

· If the data are NOT consistent with our predictions, then statistical tests are a problem. The parametric statistical tests that we commonly use cannot allow us to be confident that there are NO significant differences between or among groups.

How, then, can we use modus tollens and statistical tests together to test Measurable and General Hypotheses? "Null" Hypotheses allow us to reject hypotheses based on the statistical finding of significant differences.

If a statistical test fails to find a statistically significant difference between groups, without additional analysis we cannot be confident that there is no actual difference between groups. However, if statistical tests are performed correctly, we can be confident (to a specified confidence level) that finding a statistically significant difference between groups indicates that an actual difference exists between groups. The level of our confidence is related to the "P value," which indicates the potential for a "false positive." A false positive means that even though there is NO difference between two groups, our statistical test finds one. P < 0.05 means that there is less than a 5% chance that our statistical test found a difference between groups that wasn't actually there (i.e. was due to chance alone).

Therefore, to use statistics with modus tollens, we must select a reasoning structure that allows us to use significant differences to reject hypotheses. So-called "null" hypotheses allow us to use modus tollens to reject statistical hypotheses.

DEFINITION: A Null Hypothesis is the proposition that there is NO meaningful relationship within or among variables, in one or more populations, that explains apparent patterns among samples. In our situation where we are comparing different groups, a "null" hypothesis is a prediction that there are NO differences between or among groups.

Null Hypotheses may seem awkward because we are predicting the absence of differences instead of the presence of differences (even though the presence of differences is commonly why we create the hypotheses in the first place). However, null hypotheses can help to clarify arguments. For example, we could frame our first premise as a null hypothesis:

PREMISE 1: IF non-repetitive study does NOT result in more learning than blocked study of mathematics skills,

THEN serial study will result in scores that are NOT significantly higher than blocked study on algebra, geometry, and word problems during retention tests.

If we conduct an experiment and find:

PREMISE 2: Retention test scores on algebra, geometry, and word problem tests were significantly higher than blocked study during retention tests (t-tests; P < 0.05),

we can use modus tollens to come to the conclusion:

CONCLUSION: We reject our null hypothesis. Serial study results in more learning than blocked study of mathematics skills.

Our conclusion is both valid and sound because we use the valid syllogism modus tollens to reject a hypothesis based on a statistically significant difference.

A reasonable question might be: what if we performed our experiment and still didn't find a significant difference between groups? In the case of the lack of a significant difference, the argument becomes:

PREMISE 1: IF non-repetitive study does NOT result in more learning than blocked study of mathematics skills,

THEN serial study will result in scores that are NOT significantly higher than blocked study on algebra, geometry, and word problems during retention tests.

PREMISE 2: Retention test scores on algebra, geometry, and word problem tests after serial study were NOT higher than scores after blocked study during retention tests (t-tests; P > 0.05).

CONCLUSION: We support our null hypothesis that serial study does NOT result in more learning than blocked study of mathematics skills.

Is there a problem with the final argument?

The problem with the final argument is that, again, the argument is in the form of a logical fallacy: affirming the consequent. We do not even need to think about the limitations of statistical tests to know that the argument is invalid and cannot be sound. Therefore, null hypotheses can help to clarify reasoning.

In summary, given the limits of logic and statistics, scientists often have only ONE logically sound way to test hypotheses. Scientists can construct a specific kind of statistical hypothesis, a “null” hypothesis, that predicts that observed patterns in data (such as apparent differences between or among groups) are NOT meaningful and due to chance alone. Statistical tests can potentially reject null hypotheses (to specified levels of confidence), which in turn supports a research hypothesis that patterns in data are meaningful.

Although modus tollens can help us construct valid and sound tests of statistical and measurable hypotheses, using tests of measurable hypotheses to evaluate general hypotheses typically involves much more thought. Two approaches to using measurable hypotheses to evaluate general hypotheses are “strong inference” and inductive reasoning.

Because “strong inference” is a method to generate knowledge through deduction, we will review it first. Later, we will explore how inductive reasoning can be used to evaluate general hypotheses and improve scientific models. 