Evaluation and Annotation
The methodology of measuring language-processing systems: building annotated corpora, quantifying agreement among annotators, and scoring system output with metrics that allow fair comparison.
Definition
Evaluation and annotation is the set of practices for producing reliable labeled data and for measuring how well computational systems reproduce or predict those labels.
Scope
Covers the empirical infrastructure of computational linguistics — manual annotation schemes and guidelines, inter-annotator agreement statistics such as kappa, train/development/test partitioning, and evaluation metrics including precision, recall, F-measure, accuracy, and task-specific scores like BLEU. It addresses validity and reproducibility concerns but not the design of individual downstream systems.
Core questions
- How do we measure whether annotators agree, and why does chance-corrected agreement matter?
- Which metrics are appropriate for classification, sequence labeling, and generation tasks?
- How do train/development/test splits guard against overfitting and inflated results?
- What makes an evaluation reproducible and comparable across studies?
Key concepts
- inter-annotator agreement
- kappa statistic
- precision and recall
- F-measure
- train/development/test split
- BLEU
- annotation guidelines
- gold standard
Key theories
- Chance-corrected agreement
- Reliability of annotation should be measured with coefficients such as Cohen's or Fleiss' kappa that subtract the agreement expected by chance, not raw percentage agreement.
- Automatic n-gram-overlap evaluation
- Generation quality can be approximated cheaply by comparing system output to references via n-gram overlap, as in BLEU, enabling rapid iteration despite known limitations.
History
As corpus-based methods spread in the 1990s, the field needed shared standards for labeling data and scoring systems. Agreement statistics borrowed from content analysis were adapted to linguistic annotation, surveyed authoritatively by Artstein and Poesio, while metrics like BLEU (2002) made automatic evaluation of generation tractable and shaped shared-task culture.
Debates
- Do automatic metrics measure quality?
- Metrics such as BLEU correlate only loosely with human judgments, especially for fluent generation, fueling ongoing debate about when automatic scores are trustworthy versus when human evaluation is required.
Key figures
- Ron Artstein
- Massimo Poesio
- Kishore Papineni
Related topics
Seminal works
- artstein2008
- papineni2002
Frequently asked questions
- Why not just report accuracy?
- Accuracy can be misleading when classes are imbalanced or when both false positives and false negatives matter differently. Precision, recall, and F-measure give a more informative picture for most language tasks.