Multiple testing

The main idea...

The problem of multiple testing (also known as multiple comparisons) occurs when the same statistical test is used to assess numerous hypothesis simultaneously, without taking into account the increased probability of detecting a significant result by chance alone. 

To illustrate, suppose an acceptable Type I error rate (the probability of rejecting the null hypothesis when it is true; an incorrect result) was deemed to be 0.05 or 5% for a given test. This suggests that in 95% of the cases, the null hypothesis would be rejected when false (a correct result) for that test. However, if two such tests were conducted simultaneously, there would be a 0.95 × 0.95, or 90.25%, chance of obtaining a correct result. If 100 tests were conducted simultaneously, there is only a 0.5% (0.95100) chance of obtaining a correct result in all cases, i.e. a 99.5% chance of committing a Type I error. The 99.5% risk of committing a Type I error is far from the 5% threshold set for the single test and casts doubt on any result drawn from the set of 100 tests conducted.

Corrections for multiple testing

The family-wise error rate

The choice of correction should reflect the cost of committing either a Type I or Type II error in your individual experiment. If the cost of a Type I error (i.e. declaring a false positive) is higher than the cost of a Type II error (a false negative), a more conservative correction may be a good choice. If, however, the nature of the experiment is more speculative and exploratory, a higher Type I error rate is often more acceptable and a more sensitive correction may be used. Below, a few popular family-wise error rate (FWER) correction measures are described. These methods are of particular interest when one wants to control the risk of committing any Type I errors in the entire family of tests.

Large data sets and the false discovery rate

Methods other than FWER corrections may be more appropriate for large data sets as large numbers of tests are likely to produce adjusted p-value thresholds that are far too conservative, resulting in many false negatives (Type II errors). Large sampling campaigns in microbial ecology and the use of technologies such as microarrays and genomic sequencing require the use of techniques that provide reasonable significance corrections when thousands or millions of comparisons are performed. In this scenario, one is generally willing to accept a certain proportion of "false positives" or "false discoveries".

False discovery rate

(FDR)

Positive false discovery rate

(pFDR)

The false discovery rate (FDR; Benjamini & Hochberg, 1995) is the quotient of the number of "false discoveries" (i.e. Type I errors) divided by the total number of "discoveries" (instances where the null hypothesis has been rejected, whether it is true or false) multiplied by the probability of making at least one discovery. This quotient (Q) may be set to an acceptable level, for example 5%, or the number of false discoveries may be directly estimated. The FDR is scalable to any test size: a Q of 5% may correspond to 5 false discoveries in 100 discoveries or 50 in 1000. Further, the FDR is adaptive in the sense that the number of false discoveries is only meaningful when compared to the total number of discoveries made. 

The positive false discovery rate (pFDR) is very similar to the FDR; however, has some different properties that allow the estimation of an FDR, and hence a measure of significance, for each hypothesis tested (Storey, 2002; Storey, 2003). This measure is called the q-value and is a function of the p-value of an individual test and the distribution of the p-values for all the tests performed. As most statistical tests give non-adjusted p-values in their output, it is quite straightforward to estimate the corresponding q-values  from these p-values (Storey and Tibshiriani, 2003; see implementations). Storey (2002, 2003) argues the pFDR offers more power than the FDR.  A q-value reflects the percentage of test results (discoveries) that are likely to be false positives (false discoveries) for a given p-value in a given set of tests. For example, if a hypothesis test result, Hi, has a q-value of 0.02, 2% of hypothesis tests with a p-value less than or equal to Hi are likely to be false discoveries.

Many of these approaches assume the distribution of p-values obtained from a set of tests is "true". This can be problematic if p-values have been estimated from distributions that do not describe the distribution of a test statistic given a specific data set. Estimating p-values through resampling methods such as permutation may improve this situation.

Examples

Consider a data matrix which records the abundances of 100 species (variables) across 30 sites (objects). Pair-wise correlations may be calculated (between species 1 and 2, 1 and 3, 2 and 3, etc.) and tested for significance by e.g. permutation. An alpha level of 0.05 was deemed acceptable for a single test. If each species is to be tested against all other species, a total of 9900 (100 x 99) tests would be conducted. The Bonferroni correction would require the alpha level to be adjusted to 5.05 x 10-6, or 0.05 ÷ 9900.

Implementations

The following implementations often use different approaches to calculate FDRs and may require different input (e.g. p-values or z-scores). Please read their documentation and ensure that the FDR calculated is appropriate to your data.

References