Statistical Significance and the p-value

What the p-value does and does not tell you

The p-value expresses the probability of obtaining data at least as extreme as observed, given that the null hypothesis is true. Researchers compare it to a pre-chosen significance level α (commonly 0.05) to decide whether to reject H₀. However, statistical significance does not guarantee practical importance; a p-value can be small even for a trivially small effect. The American Statistical Association's 2016 statement emphasizes that p-values should not be used as the sole basis for scientific or policy decisions.

Core Definition: What Is the p-value?

The p-value is the probability that data as extreme as, or more extreme than, those observed would arise by chance if the null hypothesis (H₀) were true. When comparing two group means, a test statistic (e.g., t) is computed; the p-value then gives the probability of obtaining a t at least that large under H₀. The significance level α is set before data collection, and H₀ is rejected when p < α. Formally: p = P(|T| ≥ |t_observed| | H₀ true). This is a measure of data–model compatibility, not the probability that H₀ is true.

How It Works: Key Distinctions

Standard error (SE = σ / √n) depends on sample size, so very small differences can become statistically significant in large samples. This is why p-values must be reported alongside effect size measures (d, η², r, etc.). The α = 0.05 threshold is a historical convention; Fisher himself did not intend it as a rigid rule. One-tailed and two-tailed tests yield different p-values, and the choice must be justified by the hypothesis. Confidence intervals go beyond the p-value by conveying both the direction and magnitude of an effect along with its uncertainty.

Common Misinterpretations

The most pervasive misconceptions about p-values include: (1) the p-value is not the probability that H₀ is true; (2) p > 0.05 does not prove H₀, it only means the data do not contradict it strongly; (3) statistical significance does not imply practical or clinical importance; (4) the p-value does not measure the probability that a finding will replicate. When multiple comparisons are made without corrections such as Bonferroni adjustment, the Type I error rate inflates. The ASA 2016 statement systematically catalogues these misconceptions as a warning to researchers.

Why It Matters in Research Practice

Misuse of p-values has substantially contributed to the reproducibility crisis in science. p-hacking — cycling through analytic choices until a significant result emerges — and HARKing (Hypothesizing After Results are Known) are primary drivers of this crisis. Good research practice requires pre-registration of hypotheses, sharing raw data, and jointly reporting effect sizes with confidence intervals. Several journals now encourage reporting effect sizes and Bayes factors rather than relying solely on p-value thresholds. The concept of significance must be evaluated across its methodological, statistical, and practical dimensions together.

Key thinkers

Ronald A. Fisher (1890–1962)British statistician and geneticist who authored the seminal works establishing the p-value concept and the foundational logic of significance testing.

Sources

Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129–133. DOI: 10.1080/00031305.2016.1154108 ↗