Type I and Type II Errors

False positives, false negatives and power

In statistical hypothesis testing, two fundamental error types exist. A Type I error (α) occurs when a true null hypothesis is rejected — also called a false positive. A Type II error (β) occurs when a false null hypothesis is not rejected — a false negative. Statistical power (1 − β) is the probability of detecting a true effect. A trade-off governs these errors: reducing α increases β and lowers power, while larger samples and larger effect sizes raise power.

Core Concepts and Definitions

Within the Neyman–Pearson framework, a hypothesis test is exposed to two possible errors. A Type I error occurs when the null hypothesis (H₀) is rejected even though it is true. Its probability is denoted α and is conventionally set at 0.05 or 0.01. A Type II error occurs when a false null hypothesis is not rejected; its probability is denoted β. Statistical power is the probability of correctly detecting a true effect, defined as 1 − β. The four decision outcomes — correct rejection, correct retention, Type I error, and Type II error — can be arranged in a 2 × 2 table.

The Trade-off Among α, β, and Power

There is an inescapable trade-off between α and β: at a fixed sample size, lowering the α threshold raises β and reduces power. The only way to reduce both errors simultaneously is to increase sample size. Power is determined by sample size (n), effect size (δ), and the α level; increasing any of these raises power. For example, the standard error is SE = σ/√n; as n grows, SE shrinks, making a true effect easier to detect and thus raising power. A priori power analysis determines the minimum sample size needed to achieve a desired power level before data collection begins.

Common Misunderstandings

Misinterpretation of the p-value is the most common source of confusion about Type I and Type II errors. The p-value is the probability of observing a result at least as extreme as the one obtained, assuming H₀ is true; it does not measure effect size, practical importance, or replicability. Failing to reject the null does not prove H₀ is true — it only means the data do not provide sufficient evidence against it. Equally, labeling any finding that fails to cross α = 0.05 as 'null' is erroneous: in an underpowered study, the Type II error rate can be very high and a real effect may be missed. 'Statistically significant' and 'practically important' are distinct concepts.

Importance in Research Practice

The asymmetric costs of the two error types influence how α and power targets should be set in a given research domain. In medical diagnostics, the cost of a Type II error (missing a disease) can be severe, so high power is prioritized. Conversely, minimizing Type I errors is critical to prevent erroneous drug approvals. In the social sciences, the prevalence of underpowered studies is recognized as a key driver of the replication crisis. Current reporting standards call for effect sizes and confidence intervals alongside p-values; pre-registered study protocols explicitly address both Type I and Type II error risks as part of transparent research design.

Key thinkers

Jerzy Neyman (1894–1981)Polish mathematician and statistician who, together with Egon Pearson, developed the Neyman–Pearson framework for hypothesis testing.
Egon Pearson (1895–1980)British statistician who built on his father Karl Pearson's legacy and, through his collaboration with Neyman, laid the foundations of modern statistical decision theory.

Sources

Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A, 231, 289–337. DOI: 10.1098/rsta.1933.0009 ↗