ScholarGate
Assistant

Regression and Correlation

Regression and correlation are the core biostatistical tools for quantifying how variables relate to one another. Correlation measures the strength and direction of association between two quantities, while regression models how an outcome changes as one or more explanatory variables change, supporting both explanation and prediction. Together they underpin most of the multivariable analysis reported in the health sciences.

Definition

Regression and correlation comprise the statistical methods that summarise the association between variables (correlation and covariance) and that estimate a function relating an outcome to one or more explanatory variables (regression), so that the outcome can be explained, adjusted for confounders, or predicted.

Scope

This area orients the reader across the family of methods used to describe association and to model outcomes from predictors: correlation and covariance, simple and multiple linear regression for continuous outcomes, logistic regression for binary outcomes, and the cross-cutting concerns of model selection and diagnostics. It is a methodological map rather than clinical guidance, and it links to the individual topic entries where each method is developed in detail.

Sub-topics

Core questions

  • How strongly, and in what direction, are two variables associated?
  • How does an outcome change as an explanatory variable changes, holding other variables constant?
  • Which model form (linear, logistic, or other) matches the type of outcome being analysed?
  • How are regression coefficients interpreted as effects or as predictions?
  • How is a fitted model checked, selected, and kept from overfitting?

Key concepts

  • Covariance and the correlation coefficient
  • Least-squares estimation
  • Regression coefficient (slope) and intercept
  • Adjustment and confounding control through multiple regression
  • Link function and the generalized linear model framework
  • Prediction versus explanation
  • Overfitting and model validation
  • Residuals and model diagnostics

Mechanisms

Correlation reduces the joint variation of two variables (their covariance) to a scale-free coefficient between -1 and +1. Regression goes further by fitting a function — most often a line or a sum of weighted predictors — that describes the expected value of an outcome given the predictors. Linear regression estimates this function for continuous outcomes by least squares; logistic and other generalized linear models extend the same idea to binary, count, and other outcome types through a link function that connects the linear predictor to the outcome scale. Across all of these, coefficients carry the substantive interpretation, and diagnostics check whether the assumptions that justify that interpretation hold.

Clinical relevance

Most quantitative findings in clinical and public-health research — adjusted associations, risk factors, dose-response relationships, and prediction models — are produced by regression. Understanding how these models are built and interpreted is part of critically appraising the literature. This area describes how such evidence is generated and is not a basis for individual diagnostic or treatment decisions.

Evidence & guidelines

Reporting guidance for regression-based studies includes the STROBE statement for observational studies and the TRIPOD statement for prediction-model studies; standard textbook treatments such as Harrell and Vittinghoff and colleagues set out recommended modelling strategy. Methodological commentary cautions against avoidable practices such as dichotomising continuous predictors, which discards information and can distort estimated effects.

History

Correlation and regression originate in Francis Galton's late-nineteenth-century studies of heredity, where he described 'regression toward the mean,' and were placed on a formal footing by Karl Pearson. The twentieth century extended the linear model to multiple predictors, and the generalized linear model framework later unified linear, logistic, and related models. In biostatistics these methods became the standard apparatus for adjusted analysis and risk prediction.

Key figures

  • Francis Galton
  • Karl Pearson
  • David Cox
  • Frank Harrell
  • Douglas Altman

Related topics

Seminal works

  • altman-bland-2005
  • harrell-2015

Frequently asked questions

What is the difference between correlation and regression?
Correlation summarises the strength and direction of association between two variables in a single symmetric coefficient, whereas regression models how an outcome depends on one or more predictors and yields coefficients that can be used for adjustment or prediction. Correlation does not distinguish outcome from predictor; regression does.
Which regression model should be used?
The choice follows the outcome type: linear regression for a continuous outcome, logistic regression for a binary outcome, and other generalized linear or survival models for counts or time-to-event data. The individual topic entries describe each in detail.

Methods for this concept

Related concepts