Outliers and Influential Observations
Detecting and handling them sensibly
Outliers are observations that lie far from the bulk of a distribution. Influential observations are points that disproportionately alter a model's parameter estimates; not every outlier is influential, and not every influential point appears extreme. Detection relies on z-scores, the IQR rule, and — in regression — leverage, standardized residuals, and Cook's distance. Outliers should never be deleted reflexively; each case must be investigated to determine whether it reflects a data error or a genuinely important signal.
Core Concepts: Outlier and Influential Observation
An outlier is an observation that deviates markedly from the central tendency or spread of a distribution. An influential observation is a data point whose removal substantially changes model parameters, predictions, or conclusions. The two concepts overlap but are not identical. A point near the center of the predictor space but measured with error can be highly influential, while some extreme observations have little effect on overall results. Understanding this distinction is critical for avoiding hasty mistakes during data cleaning.
Detection Methods and Formulas
In univariate settings two common rules apply. The z-score method: z = (x − x̄) / SD; |z| > 3 is often flagged as a potential outlier. The IQR rule: lower fence = Q1 − 1.5 × IQR, upper fence = Q3 + 1.5 × IQR; observations outside these fences are considered possible outliers. In regression, additional statistics are needed: leverage h_ii measures how extreme a point is in predictor space; standardized residuals identify large errors; Cook's distance D combines both dimensions to summarize overall influence. D > 1 is commonly treated as concerning, though no fixed threshold is universal.
Common Misconceptions
The most common error is assuming every outlier must be deleted. This practice biases results and can conceal genuine effects. A second misconception is treating outlier and influential observation as synonyms; an observation can be both, either, or neither. A third error is assuming outliers always reflect measurement mistakes; in practice they are sometimes the most informative data points. Finally, applying the IQR rule mechanically to non-normal distributions can flag many genuine observations as suspicious, so contextual judgment is always required.
Importance in Research Practice and the Sensible Approach
Ignoring outliers and influential observations can lead to biased parameter estimates, inflated standard errors, and irreproducible findings. The recommended approach is: first investigate the observation for recording errors, coding mistakes, or out-of-population membership; then run the analysis both with and without the observation and report how much the results change; if deletion is decided, document this transparently in the methods section. In some fields, outlying observations are not exceptions that prove the rule but findings that call for theoretical revision.
Sources
- Tabachnick, B. G., & Fidell, L. S. (2019). Using Multivariate Statistics (7th ed.). Pearson. ISBN: 978-0-13-479054-1