The Multiple Comparisons Problem

Many tests inflate false positives

When many hypothesis tests are conducted in a single study, the probability of obtaining at least one false positive by chance rises far above the nominal α level. This phenomenon is captured by the concept of the family-wise error rate (FWER). The Bonferroni correction controls FWER by dividing α by the number of tests; the Benjamini–Hochberg procedure instead controls the false discovery rate (FDR), offering greater statistical power in large-scale testing scenarios.

Core Idea and Definition

When a single hypothesis test is conducted at α = 0.05, there is a 5% chance of a false positive (Type I error) even when no true effect exists. However, when k independent tests are run, the probability of at least one false positive is given by: FWER = 1 − (1 − α)^k. For example, running 20 independent tests pushes this probability to roughly 64%. The family-wise error rate describes the cumulative Type I error risk across a 'family' of related tests and lies at the heart of the multiple comparisons problem.

How Correction Methods Work

The Bonferroni correction sets the significance threshold for each individual test to α/k, where k is the total number of tests. This approach provides strict FWER control but can be overly conservative when tests are correlated, substantially reducing statistical power. The Benjamini–Hochberg (1995) procedure takes a different approach: it ranks p-values and determines a rejection threshold that bounds the false discovery rate (FDR) at a desired level. The FDR is the expected proportion of false positives among all rejected hypotheses, making this method far more powerful than Bonferroni in large-scale genomic or neuroscience studies.

Common Misuses and Misconceptions

A common misconception is that correction is only necessary for post-hoc comparisons; in fact the problem applies equally when many independent tests are conducted across a research design. Another error is misdefining the 'family': all tests performed on the same dataset generally belong to the same family. It is also wrong to assume Bonferroni is always 'safe'; with negatively correlated tests it can over-control FWER and reject reasonable hypotheses. FDR control is not 'tolerating errors' — it is a deliberate trade-off to preserve power in exploratory research.

Why It Matters in Research Practice

The multiple comparisons problem is recognised as one of the principal drivers of the replication crisis in science. P-hacking and selective reporting compound the issue further. Researchers can reduce false-positive risk by pre-registering which tests they will conduct before data collection and by transparently reporting all tests performed. The choice of correction method depends on whether the study is confirmatory — where FWER control is critical — or exploratory, where FDR control is often sufficient. In either case, the decision to apply or forgo correction should be explicitly justified and reported.

Sources

Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.). SAGE. ISBN: 978-1-5264-1951-4