Why not just report accuracy?

Accuracy can be misleading when classes are imbalanced or when both false positives and false negatives matter differently. Precision, recall, and F-measure give a more informative picture for most language tasks.

Evaluation and Annotation

The methodology of measuring language-processing systems: building annotated corpora, quantifying agreement among annotators, and scoring system output with metrics that allow fair comparison.

Definition

Evaluation and annotation is the set of practices for producing reliable labeled data and for measuring how well computational systems reproduce or predict those labels.

Scope

Covers the empirical infrastructure of computational linguistics — manual annotation schemes and guidelines, inter-annotator agreement statistics such as kappa, train/development/test partitioning, and evaluation metrics including precision, recall, F-measure, accuracy, and task-specific scores like BLEU. It addresses validity and reproducibility concerns but not the design of individual downstream systems.

Core questions

How do we measure whether annotators agree, and why does chance-corrected agreement matter?
Which metrics are appropriate for classification, sequence labeling, and generation tasks?
How do train/development/test splits guard against overfitting and inflated results?
What makes an evaluation reproducible and comparable across studies?

Key concepts

inter-annotator agreement
kappa statistic
precision and recall
F-measure
train/development/test split
BLEU
annotation guidelines
gold standard

Key theories

Chance-corrected agreement: Reliability of annotation should be measured with coefficients such as Cohen's or Fleiss' kappa that subtract the agreement expected by chance, not raw percentage agreement.
Automatic n-gram-overlap evaluation: Generation quality can be approximated cheaply by comparing system output to references via n-gram overlap, as in BLEU, enabling rapid iteration despite known limitations.

History

As corpus-based methods spread in the 1990s, the field needed shared standards for labeling data and scoring systems. Agreement statistics borrowed from content analysis were adapted to linguistic annotation, surveyed authoritatively by Artstein and Poesio, while metrics like BLEU (2002) made automatic evaluation of generation tractable and shaped shared-task culture.

Debates

Do automatic metrics measure quality?: Metrics such as BLEU correlate only loosely with human judgments, especially for fluent generation, fueling ongoing debate about when automatic scores are trustworthy versus when human evaluation is required.

Key figures

Ron Artstein
Massimo Poesio
Kishore Papineni

Seminal works

artstein2008
papineni2002

Frequently asked questions

Why not just report accuracy?: Accuracy can be misleading when classes are imbalanced or when both false positives and false negatives matter differently. Precision, recall, and F-measure give a more informative picture for most language tasks.