Statistical Significance and the p-value

What the p-value does and does not tell you

The p-value expresses the probability of obtaining data at least as extreme as observed, given that the null hypothesis is true. Researchers compare it to a pre-chosen significance level α (commonly 0.05) to decide whether to reject H₀. However, statistical significance does not guarantee practical importance; a p-value can be small even for a trivially small effect. The American Statistical Association's 2016 statement emphasizes that p-values should not be used as the sole basis for scientific or policy decisions.

Core Definition: What Is the p-value?

The p-value is the probability that data as extreme as, or more extreme than, those observed would arise by chance if the null hypothesis (H₀) were true. When comparing two group means, a test statistic (e.g., t) is computed; the p-value then gives the probability of obtaining a t at least that large under H₀. The significance level α is set before data collection, and H₀ is rejected when p < α. Formally: p = P(|T| ≥ |t_observed| | H₀ true). This is a measure of data–model compatibility, not the probability that H₀ is true.

How It Works: Key Distinctions

Standard error (SE = σ / √n) depends on sample size, so very small differences can become statistically significant in large samples. This is why p-values must be reported alongside effect size measures (d, η², r, etc.). The α = 0.05 threshold is a historical convention; Fisher himself did not intend it as a rigid rule. One-tailed and two-tailed tests yield different p-values, and the choice must be justified by the hypothesis. Confidence intervals go beyond the p-value by conveying both the direction and magnitude of an effect along with its uncertainty.

Common Misinterpretations

The most pervasive misconceptions about p-values include: (1) the p-value is not the probability that H₀ is true; (2) p > 0.05 does not prove H₀, it only means the data do not contradict it strongly; (3) statistical significance does not imply practical or clinical importance; (4) the p-value does not measure the probability that a finding will replicate. When multiple comparisons are made without corrections such as Bonferroni adjustment, the Type I error rate inflates. The ASA 2016 statement systematically catalogues these misconceptions as a warning to researchers.

Why It Matters in Research Practice

Misuse of p-values has substantially contributed to the reproducibility crisis in science. p-hacking — cycling through analytic choices until a significant result emerges — and HARKing (Hypothesizing After Results are Known) are primary drivers of this crisis. Good research practice requires pre-registration of hypotheses, sharing raw data, and jointly reporting effect sizes with confidence intervals. Several journals now encourage reporting effect sizes and Bayes factors rather than relying solely on p-value thresholds. The concept of significance must be evaluated across its methodological, statistical, and practical dimensions together.

Key thinkers

  • Ronald A. Fisher (1890–1962)British statistician and geneticist who authored the seminal works establishing the p-value concept and the foundational logic of significance testing.

Sources

  1. Wasserstein, R. L., & Lazar, N. A. (2016). The ASA statement on p-values: context, process, and purpose. The American Statistician, 70(2), 129–133. DOI: 10.1080/00031305.2016.1154108