Why can accuracy be misleading?

If one class is far more common than another, a model that always predicts the majority class can have high accuracy while being useless for the rare class. Metrics like precision, recall, and the area under the receiver operating characteristic curve reveal this kind of failure.

What is the difference between precision and recall?

Precision is the fraction of predicted positives that are truly positive, measuring how trustworthy positive predictions are. Recall is the fraction of actual positives that the model finds, measuring how many real cases it catches. Improving one often comes at the expense of the other.

Performance Metrics

Performance metrics quantify how well a model predicts, with different measures capturing different aspects of accuracy and different costs of error.

Troba un tema amb PaperMindAviatFind papers & topics

Tools & resources

Baixa les diapositives

Learn & explore

VídeoAviat

Definition

A performance metric is a numerical summary of how closely a model's predictions match the true targets on evaluation data; the appropriate metric depends on the task type, the balance of classes, and the relative costs of different kinds of error.

Scope

This topic covers the measures used to judge model performance: for classification, accuracy, precision, recall, the F-measure, and the receiver operating characteristic curve with its area; for regression, mean squared and absolute error and the coefficient of determination; and probabilistic scoring rules such as log loss. It addresses the choice of metric under class imbalance and asymmetric error costs.

Core questions

Which metric reflects the true objective of a given task?
How do precision and recall trade off, and when does each matter?
Why can accuracy mislead on imbalanced data?
How are probabilistic predictions scored?

Key theories

Confusion-matrix metrics: From the counts of true and false positives and negatives come accuracy, precision, recall, and the F-measure, which expose trade-offs that a single accuracy figure can hide.
Threshold-independent evaluation: The receiver operating characteristic curve and its area summarize a classifier's performance across all decision thresholds, useful when the operating point is not fixed in advance.
Proper scoring rules: Scoring rules such as log loss reward well-calibrated probability estimates and are minimized by reporting true probabilities, encouraging honest probabilistic prediction.

Clinical relevance

Choosing the right performance metric is critical because a model optimized or judged by the wrong measure can perform poorly on what actually matters; in imbalanced or cost-sensitive settings such as fraud or disease detection, naive accuracy is especially misleading, and metrics must reflect the real consequences of errors.

History

Many metrics originate outside machine learning, with precision and recall from information retrieval and the receiver operating characteristic from signal detection theory. As machine learning was applied to imbalanced and high-stakes problems, careful metric choice and reporting of multiple complementary metrics became standard methodology.

Key figures

Trevor Hastie
Tom Fawcett
Christopher Bishop

Seminal works

hastie2009
bishop2006
murphy2012

Frequently asked questions

Why can accuracy be misleading?: If one class is far more common than another, a model that always predicts the majority class can have high accuracy while being useless for the rare class. Metrics like precision, recall, and the area under the receiver operating characteristic curve reveal this kind of failure.
What is the difference between precision and recall?: Precision is the fraction of predicted positives that are truly positive, measuring how trustworthy positive predictions are. Recall is the fraction of actual positives that the model finds, measuring how many real cases it catches. Improving one often comes at the expense of the other.