Regression Diagnostics
Checking the assumptions of a regression
Regression diagnostics systematically examine whether the assumptions of a fitted model actually hold. Residual plots expose nonlinearity and heteroscedasticity; Q-Q plots assess whether residuals follow a normal distribution; the Durbin–Watson statistic detects autocorrelation; the variance inflation factor (VIF) measures multicollinearity; and leverage together with Cook's distance identify influential observations. Violations can bias coefficient estimates and distort standard errors.
Concept and Core Logic
Regression models rest on several critical assumptions: errors have zero mean and constant variance, errors are independent of one another, no perfect linear relationship exists among predictors, and extreme observations do not unduly dominate estimates. Diagnostics test each assumption empirically. No single diagnostic test is conclusive on its own; findings should be evaluated as a whole. Confidence intervals and p-values reported without checking assumptions can be seriously misleading.
Key Tools and How to Read Them
Residuals-versus-fitted plots: a horizontal band pattern is desirable; a curved or funnel-shaped pattern signals a problem. Q-Q plots: notable deviation of residuals from the diagonal line suggests the normality assumption is questionable. The Durbin–Watson statistic ranges from 0 to 4; values near 2 indicate no autocorrelation, while values near 0 or 4 indicate strong autocorrelation. For VIF, a common threshold is 10, but values above 5 also warrant attention. Cook's distance threshold of 1 is widely cited, though sample-size-sensitive interpretation is more reliable.
Common Misuses and Misconceptions
The most frequent error is skipping diagnostics entirely or relying on only a single test. Normality is often over-emphasized; in moderate samples the Central Limit Theorem largely protects coefficient inference. Heteroscedasticity and autocorrelation, however, genuinely distort standard errors and must not be ignored. A high VIF is not always a serious problem: if the correlation among predictors is theoretically expected, careful interpretation may suffice. Observations with high Cook's distance may be legitimate data points; reporting their influence rather than deleting them is the more honest approach.
Why It Matters and How to Report It
Diagnostic findings should be reported explicitly in the methods section. State that residual plots were inspected, report the obtained VIF values, and note whether autocorrelation was tested. Detecting a violation does not invalidate the model; corrective steps such as transformation, robust standard errors, or a different model family should be applied and documented. Studies that omit diagnostics entirely are increasingly criticized by reviewers. Transparent reporting improves both reproducibility and the credibility of conclusions.
Sources
- Belsley, D. A., Kuh, E., & Welsch, R. E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons. ISBN: 978-0-471-05856-4