Why is multiple regression used to control for confounding?

Because each coefficient estimates the effect of its predictor while the other predictors in the model are held constant, including a confounder as a predictor adjusts the estimated effect of the exposure of interest for that confounder.

What is multicollinearity and why does it matter?

Multicollinearity is strong correlation among predictors. It makes individual coefficient estimates unstable and difficult to interpret, with inflated standard errors, even though the model's overall predictive accuracy may be unaffected.

Multiple Linear Regression

Multiple linear regression extends the straight-line model to several explanatory variables at once, modelling a continuous outcome as a weighted sum of predictors plus an intercept. Each coefficient estimates the effect of its predictor while holding the others constant, which makes the model the standard tool for adjusting an association for confounders and for building multivariable prediction.

Definition

Multiple linear regression fits E(Y) = b0 + b1X1 + b2X2 + ... + bkXk for a continuous outcome Y, estimating the coefficients by least squares so that each bj quantifies the average change in Y per one-unit increase in Xj while the other predictors are held constant.

Scope

This entry covers the multivariable linear model: the interpretation of each coefficient as an adjusted effect, handling of categorical predictors and interactions, the additional concerns introduced by several predictors (collinearity, overfitting, and how predictors are chosen), and the same residual-based assumptions as the simple model. It is a methodological topic, not clinical guidance.

Core questions

What does it mean to interpret a coefficient 'holding the other variables constant'?
How does multiple regression adjust an association for confounders?
How are categorical predictors and interactions represented in the model?
What problems do collinearity and too many predictors cause?
How is the number of predictors balanced against the sample size to avoid overfitting?

Key concepts

Adjusted (partial) regression coefficient
Confounding control through adjustment
Dummy coding of categorical predictors
Interaction (effect modification) terms
Multicollinearity
Overfitting and events/observations per predictor
Model R-squared and adjusted R-squared
Linearity, independence, constant variance, normal errors

Mechanisms

The model expresses the mean outcome as an intercept plus a weighted sum of predictors, with the weights (coefficients) estimated by least squares. Each coefficient is a partial effect: the expected change in the outcome per unit change in that predictor with the others fixed, which is the mechanism by which regression adjusts for confounding. Categorical predictors enter as indicator (dummy) variables, and interaction terms allow the effect of one predictor to depend on another. When predictors are strongly correlated (multicollinearity), individual coefficients become unstable and hard to interpret even though overall prediction may be unaffected. Including too many predictors relative to the sample size leads to overfitting, where the model captures noise and performs poorly on new data; this motivates limiting predictors in relation to sample size and validating the model.

Clinical relevance

Multiple linear regression produces most of the adjusted associations reported for continuous outcomes in clinical and public-health research and is a building block of risk-prediction work. Knowing how its coefficients are interpreted and how confounding is controlled is central to appraising such studies. This entry describes the method and is not a basis for individual diagnostic or treatment decisions.

Evidence & guidelines

Standard texts such as Kutner and colleagues and Harrell set out recommended modelling strategy, and methodological work warns against avoidable practices — notably dichotomising continuous predictors, which discards information and can bias estimates. Prediction-model reporting is covered by the TRIPOD statement.

History

The multivariable extension of the linear model developed through the early-twentieth-century work of Pearson, Fisher, and others, who established least-squares estimation and inference for several predictors. In biostatistics the model became the standard method for adjusting associations for confounders, and later methodological literature focused on how predictors should be selected and how overfitting and dichotomisation distort results.

Debates

Should continuous predictors be dichotomised in a regression model?: Splitting a continuous predictor at a cut-point discards information, reduces power, and can distort the estimated relationship; methodologists argue continuous predictors should usually be kept continuous, with nonlinearity modelled flexibly rather than removed by categorisation.

Key figures

Karl Pearson
Ronald A. Fisher
Frank Harrell
Douglas Altman
Patrick Royston

Seminal works

altman-royston-2006-cost
harrell-2015

Frequently asked questions

Why is multiple regression used to control for confounding?: Because each coefficient estimates the effect of its predictor while the other predictors in the model are held constant, including a confounder as a predictor adjusts the estimated effect of the exposure of interest for that confounder.
What is multicollinearity and why does it matter?: Multicollinearity is strong correlation among predictors. It makes individual coefficient estimates unstable and difficult to interpret, with inflated standard errors, even though the model's overall predictive accuracy may be unaffected.