Multiple comparisons app

(Click image to open app)

This app compares the Tukey HSD and Bonferroni adjustments to unadjusted t-tests, using simulated data

User can set sample size per group and number of groups from which data are generated. Critical values using the Tukey HSD or Bonferroni adjustments are shown on the upper plot; unadjusted critical values under Student's t are shown on the lower plot. Significant and non-significant differences are also distinguished by color. Data are by default generated under the null hypothesis that all group population means are equal, so that any significant differences are Type I errors. User can set the mean of group A to differ from the rest by half a standard deviation, so that any non-significant differences between A and another group are Type II errors.

The app only generates new differences when "Generate data" is clicked. Any changes to settings made by the user should be followed immediately by clicking "Generate data", as the currently displayed differences will reflect previous settings.

Comparing Type I and Type II errors

This app does not calculate Type I and Type II error rates. Rather, for each new set of simulated data, it displays test statistics for all pairwise comparisons. Test statistics falling in the rejection region are colored purple.

Type I errors: By default, all population means are equal, so any test statistics in the rejection region (colored purple) are Type I errors. The Tukey and Bonferroni adjustments control family-wise Type I error rate to not exceed 5%, so you should not often see Type I errors on the top plot. You should frequently see Type I errors on the bottom plot, particularly when number of groups is larger.

Type II errors: Selecting "Make μₐ ≠ others" makes the null hypothesis false for any comparison between Group A and another group (μₐ is half a standard deviation larger than the others, i.e. d = 0.5). The test statistics for these comparisons are slightly vertically shifted so as to distinguish them from the test statistics for null comparisons. Any of these test statistics not in the rejection reason (i.e. not colored purple) are Type II errors.

  • You should see Type II errors more frequently in the top plot, since the Tukey and Bonferroni adjustments reduce statistical power.

  • You should see Type II errors more frequently when same size is smaller, and less frequently when sample size is larger.

Comparing Tukey HSD and Bonferroni adjustments

  • The Bonferroni adjustment shifts critical values outward by dividing 𝛼 by number of comparisons.

  • The Tukey HSD ("Honestly Significant Difference") adjustment uses the "Studentized range distribution", which is the distribution of the maximum mean difference across all pairwise comparisons, divided by standard error.

  • The test statistics under Bonferroni and Tukey differ; the Tukey test statistics are pooling variance across all groups to calculate standard error. The "Bonferroni" test statistics are just Student's t statistics, pooling variance only across the two groups being compared. Their differences will be most pronounced when n is small.

  • For larger number of comparisons, the Bonferroni adjustment becomes more conservative than the Tukey HSD adjustment. To best see this:

    1. Set n per group to 35

    2. Set number of groups to 8

    3. Select "Make μₐ ≠ others"

    4. Click "Generate Data" repeatedly, stopping when you see purple differences that are just beyond the critical value under Tukey. Switch from Tukey to Bonferroni. You should often see at least one of the purple differences change color, showing that it is inside the critical value under Bonferroni.

    5. This occurs more often when unadjusted power is low-to-moderate, which is why a relatively small sample size works best.

    6. Caution: do not draw any general conclusions about what a "good" sample size is from this. Power depends on sample size and effect size, and this app fixes effect size when null is false at d = 0.5.

  • A note on the Tukey HSD plot: Tukey's Studentized range distribution is formally defined for positive values only, where the maximum mean difference is defined as the maximum minus the minimum among all group means. I wanted this app to allow for a clear visual comparison of the adjustment methods and unadjusted Student's t, so I set it to plot both a positive and negative version of the Studentized range distribution, and to allow positive and negative test statistics. All values under the formally defined (positive-only) Studentized range distribution are divided by sqrt(2), making it equivalent to Student's t when there are only two groups, and giving it the correct 5% family-wise Type I error rate when all population means are equal.