Statistical Power and Sample Size

The chance of detecting a real effect

Statistical power (1 − β) is the probability of detecting an effect that truly exists. Sample size, effect size, significance level, and variance are the four key determinants of power. Underpowered studies miss real effects and produce unreliable, exaggerated estimates. An a priori power analysis calculates the sample size needed for adequate power before data collection begins, with 0.80 commonly adopted as the target threshold.

Core Concept: What Is Statistical Power?

Statistical power is the probability that a hypothesis test correctly detects a true effect, expressed as 1 − β, where β is the Type II error (false negative) probability. It measures the capacity to reject the null hypothesis when it is actually false. Power depends on four factors: effect size (expressed via standard metrics such as d, f, or r), sample size (n), significance level (α), and variance. Changing any one of these alters power. There is a trade-off between power and α: lowering α (a stricter threshold) reduces power, and a larger sample is required to compensate.

How Is Power Computed?

At the heart of power calculation is the standard error of the sample mean: SE = SD / √n. As sample size grows, SE shrinks, making it easier for the observed statistic to depart from the null distribution and thus increasing power. In an a priori power analysis, the researcher specifies the target power (typically 0.80), α level (typically 0.05), and expected effect size, then solves for the minimum required sample size. This can be done with software such as G*Power or via closed-form formulas. Observed (post hoc) power analyses conducted after data collection are based on the observed data and are difficult to interpret; power cannot be legitimately inferred from an obtained p value.

Common Misconceptions

One of the most common misconceptions is treating a statistically non-significant result as evidence that no effect exists. In an underpowered study, p > 0.05 indicates that even a genuine effect might go undetected, not that the effect is absent. The 0.80 power threshold is itself a convention, not a law; higher power (0.90 or 0.95) may be warranted depending on context. Another error is setting sample size by intuition: conventional thresholds such as n ≥ 30 provide no power guarantee. Finally, high power does not guarantee statistical significance; it only increases the probability of avoiding a Type II error by chance.

Why It Matters in Research Practice

Underpowered studies cause harm in two directions: they miss real effects (Type II error) and, combined with publication bias, produce inflated effect size estimates, because results reaching significance in small samples tend to be those with exaggerated magnitudes. This fuels non-replicable findings in the literature. Power analysis is also critical for resource planning: unnecessarily large samples waste time and funding, while samples that are too small render a study scientifically worthless. From an ethical standpoint, enrolling participants in a study too small to yield a meaningful result is indefensible. For all these reasons, power analysis should be an integral part of study design, not an afterthought.

Sources

Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Lawrence Erlbaum Associates. ISBN: 978-0-8058-0283-2