Classification Metrics: Accuracy, Precision, Recall, F1

Evaluating a classifier correctly

Summarising a classification model with a single number is often misleading. Accuracy hides true performance when classes are imbalanced. Precision measures how many predicted positives are truly positive; recall measures how many real positives were actually caught. The F1 score is their harmonic mean, balancing the two. Which metric to use depends on whether false positives or false negatives carry the higher cost in a given problem.

Core Concepts and Formulas

The outputs of a binary classifier fall into four cells: true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The key metrics derived from these cells are: Accuracy = (TP + TN) / (TP + FP + TN + FN); Precision = TP / (TP + FP); Recall = TP / (TP + FN); F1 = 2 × Precision × Recall / (Precision + Recall). Precision asks how many of the observations labelled positive by the model are truly positive. Recall measures how many of all actual positives the model managed to detect. Because F1 uses the harmonic mean, it penalises heavily when either component is very low.

Computing and Reading the Metrics

In practice these metrics are computed from a confusion matrix. Consider 100 samples where 90 are negative and 10 are positive, and the model predicts all as negative: accuracy reaches 90 percent yet precision and recall are both zero. This illustrates clearly why a single metric should not be trusted on imbalanced datasets. For multi-class problems the metrics are computed per class and then summarised as macro-average (treats all classes equally) or weighted average (weights by class size). The choice of averaging strategy must always be stated explicitly in any report.

Common Misuses and Misconceptions

The most frequent error is reporting accuracy as the sole criterion on imbalanced datasets. A model that detects a rare disease by labelling everyone as negative achieves high accuracy while catching zero patients. A second common mistake is ignoring the precision-recall trade-off: lowering the decision threshold raises recall but lowers precision, and vice versa. F1 captures this balance automatically and is therefore preferred, but it is still insufficient alone and should be presented alongside an ROC curve or AUC. Finally, F1 implicitly treats both error types as equally costly, an assumption that rarely holds in real applications.

Best Practices for Reporting

When reporting a classification study, good practice includes explicitly stating the class distribution, presenting the full confusion matrix, reporting precision and recall separately, and providing both F1 and AUC together. If the cost structure of the problem is known, the analysis should explain which type of error matters more. In medical diagnosis false negatives are typically very costly, making recall the primary metric. In retrieval systems where precision matters most, that metric should lead. Crucially, metric selection should be anchored to the problem definition before data collection begins, not chosen retrospectively to favour the model.

Sources

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. ISBN: 978-0-387-84857-0