Correlation vs Causation

Association does not imply causation

A correlation between two variables does not establish that one causes the other. Confounding variables, reverse causation, selection effects, and chance can all produce statistical association without any genuine causal link. Establishing causation requires randomized controlled experiments or causal-inference designs such as instrumental variables, difference-in-differences, or regression discontinuity, along with explicit and testable assumptions about the underlying data-generating process.

Core Concept: What Is Correlation?

Correlation is a statistical index that measures how two variables co-vary. The Pearson correlation coefficient r ranges from −1 to +1; r = 0 indicates no linear relationship, while r = 1 or r = −1 indicates a perfect linear association. Correlation tells us that two variables move together but conveys nothing about what drives that movement. For example, ice cream sales and drowning incidents may show a high positive correlation without one causing the other: hot weather is the common cause that raises both simultaneously.

Mechanisms That Confuse Association with Causation

Four principal mechanisms can generate correlation without causation. (1) Confounding: a third variable Z affects both X and Y, creating a spurious association between them. (2) Reverse causation: the true direction is Y → X, but the researcher mistakenly infers X → Y. (3) Selection effects: the sampling process itself induces an association that does not hold in the population. (4) Chance: with large datasets and many tests, spuriously significant associations appear due to the multiple-comparisons problem. Causal interpretations drawn without ruling out these mechanisms are scientifically invalid.

How to Establish Causation

The strongest evidence for causation comes from randomized controlled trials (RCTs): random assignment balances confounders across groups by design. When experiments are infeasible, observational causal-inference methods are used. Instrumental variable (IV) designs exploit an external variable that affects the outcome only through the treatment. Difference-in-differences (DiD) compares pre- and post-treatment trends to eliminate time-invariant confounders. Regression discontinuity (RD) uses a sharp assignment threshold as a natural experiment. Pearl's structural causal model (SCM) formalizes the logic underlying all these designs through the do(·) operator and counterfactual reasoning.

Why It Matters in Research Practice

Conflating correlation with causation leads to serious errors in health, education, and economic policy. Claiming that an intervention works requires demonstrating that the observed association between the treatment and the outcome is genuinely causal. Researchers are therefore obliged to prospectively identify potential confounders, control for them through design, and report them transparently. Data visualizations and machine-learning models detect patterns only; establishing causal validity requires additional design and explicit assumptions. In research reports, the phrase 'X causes Y' should be used only when supported by an appropriate causal-inference design.

Key thinkers

  • Judea Pearl (1936–)Computer scientist and philosopher who developed structural causal models and the do-calculus, establishing the mathematical foundations of causal inference.

Sources

  1. Pearl, J. (2009). Causality: Models, Reasoning, and Inference (2nd ed.). Cambridge University Press. ISBN: 978-0-521-89560-6