ScholarGate
Asistenti

Logistic Regression

Logistic regression models the probability of a binary (yes/no) outcome as a function of one or more predictors. Because probabilities are bounded between 0 and 1, the model works on the log-odds scale, so that each coefficient corresponds to a change in the log-odds and exponentiates to an odds ratio. It is the standard regression method for binary outcomes in the health sciences.

Gjeni temë me PaperMindSë shpejtiFind papers & topics
Tools & resources
Shkarko diapozitivat
Learn & explore
VideoSë shpejti

Definition

Logistic regression models the log-odds (logit) of a binary outcome as a linear function of predictors, logit(P) = b0 + b1X1 + ... + bkXk, estimating the coefficients by maximum likelihood so that the exponentiated coefficient exp(bj) is the adjusted odds ratio for predictor Xj.

Scope

This entry covers the binary logistic model: the logit link and why it is used, the interpretation of coefficients as odds ratios, maximum-likelihood estimation, adjustment for confounders, and the practical concerns of sample size (events per variable), separation, and goodness of fit. It also notes the distinction between odds ratios and risk ratios. It is a methodological topic, not clinical guidance.

Core questions

  • Why is a binary outcome modelled on the log-odds scale rather than directly as a probability?
  • How is a logistic-regression coefficient interpreted as an odds ratio?
  • How are coefficients estimated, and how does the model adjust for confounders?
  • How many outcome events are needed per predictor for stable estimates?
  • When does the odds ratio differ importantly from the risk ratio?

Key concepts

  • Logit (log-odds) link function
  • Odds ratio as exp(coefficient)
  • Maximum-likelihood estimation
  • Adjusted versus crude odds ratio
  • Events per variable
  • Separation and quasi-complete separation
  • Goodness of fit and calibration
  • Odds ratio versus risk ratio

Mechanisms

Modelling a probability directly with a linear predictor is problematic because predictions could fall outside 0 to 1; the logit link solves this by transforming the probability to its log-odds, which is unbounded and modelled linearly. The coefficients are estimated by maximum likelihood rather than least squares, and each exponentiated coefficient is the odds ratio comparing the odds of the outcome for a one-unit difference in that predictor with the others held constant. Stable estimation requires enough outcome events relative to the number of predictors; the traditional guidance of about ten events per variable has been examined and partly relaxed in later work. When a predictor perfectly separates outcome classes, ordinary maximum likelihood fails (separation), and penalised approaches address it. Because the model estimates odds ratios, these can overstate the risk ratio when the outcome is common, which has motivated alternative approaches such as modified Poisson regression for estimating risk ratios directly.

Clinical relevance

Logistic regression underlies a large share of the adjusted odds ratios and diagnostic and prognostic models reported in clinical and epidemiological research. Understanding that its coefficients are odds ratios, and when those diverge from risk ratios, is central to interpreting such studies. This entry describes the method and is not a basis for individual diagnostic or treatment decisions.

Epidemiology

Logistic regression is the natural analysis for case-control studies, where the odds ratio is the estimable measure of association, and is widely used in cohort and cross-sectional studies to obtain adjusted effect estimates for binary outcomes. When the outcome is common in a cohort, the odds ratio departs from the risk ratio, and analysts may prefer methods that estimate the risk ratio directly.

Evidence & guidelines

Hosmer, Lemeshow, and Sturdivant's text is a standard reference for fitting and assessing logistic models. Reporting of prediction models built with logistic regression is covered by the TRIPOD statement, and methodological studies inform sample-size guidance such as events per variable.

History

The logistic function has nineteenth-century roots in population growth, and its use for binary regression was developed in the mid-twentieth century, with David Cox's work consolidating the method for the analysis of binary data. It became a workhorse of epidemiology, especially for case-control analysis where the odds ratio is the natural measure. Subsequent methodological literature addressed practical issues including sample size, separation, and the divergence of odds ratios from risk ratios.

Debates

How many outcome events are needed per predictor?
A long-standing rule of about ten events per variable was supported by simulation work, but later studies argued the rule is conservative and context-dependent, so that fewer events may sometimes suffice while more may be needed in others.
Should the odds ratio be used when the outcome is common?
When an outcome is common, the odds ratio overstates the risk ratio and can be misinterpreted as a relative risk; alternatives such as modified Poisson regression estimate the risk ratio directly and have been proposed for prospective studies with binary outcomes.

Key figures

  • David Cox
  • David Hosmer
  • Stanley Lemeshow
  • Peter Peduzzi
  • Eric Vittinghoff

Related topics

Seminal works

  • hosmer-2013
  • peduzzi-1996

Frequently asked questions

Why does logistic regression report odds ratios?
Because the model is linear on the log-odds scale, each coefficient represents a change in log-odds, and exponentiating it gives an odds ratio. The odds ratio is therefore the natural effect measure that the model produces for a binary outcome.
When is an odds ratio a poor approximation to the risk ratio?
When the outcome is common, the odds ratio diverges from and overstates the risk ratio. In that situation an odds ratio can mislead if read as a relative risk, and methods that estimate the risk ratio directly may be preferable.

Methods for this concept

Related concepts