Why build treebanks by hand if parsers exist?

Parsers are trained and evaluated against human-annotated treebanks, which serve as the gold standard. Without reliable hand annotation there would be nothing to learn from or to measure accuracy against.

Treebanks and Annotated Corpora

Corpora hand-annotated with linguistic structure — syntactic trees, dependencies, senses, and entities — that serve as training data and gold standards for computational linguistics.

Definition

A treebank is a corpus in which each sentence is annotated with its syntactic structure; more broadly, an annotated corpus carries explicit linguistic labels added by humans.

Scope

Covers the design and construction of annotated corpora, especially treebanks carrying constituency or dependency syntax, and the annotation pipelines, guidelines, and quality control behind them. It includes the Penn Treebank tradition and the cross-lingual Universal Dependencies effort, and the role of inter-annotator agreement. General corpus design and lexical resources are covered in sibling topics.

Core questions

How are treebanks designed and what annotation schemes do they use?
Why are annotated corpora indispensable for supervised learning?
How is annotation quality assured and measured?
How does cross-lingual annotation like Universal Dependencies achieve consistency?

Key concepts

treebank
annotation scheme
annotation guidelines
gold standard
inter-annotator agreement
Penn Treebank
Universal Dependencies
adjudication

Key theories

Treebank-driven supervised learning: Hand-annotated syntactic corpora provide the supervision signal that made statistical parsing, tagging, and many NLP tasks possible.
Cross-lingual harmonized annotation: Universal Dependencies applies a single annotation scheme across many languages, enabling comparable treebanks and transfer of models.

History

The Penn Treebank (1993) was the first large syntactically annotated corpus and catalyzed statistical parsing. Subsequent treebanks added semantic and discourse layers, and the Universal Dependencies project standardized annotation across languages, becoming the de facto multilingual treebank resource.

Debates

Annotation depth versus consistency: Richer annotation captures more linguistic detail but is harder to apply consistently; projects must balance theoretical sophistication against reliable, scalable annotation.

Key figures

Mitchell Marcus
Beatrice Santorini
Marie-Catherine de Marneffe
Joakim Nivre

Seminal works

marcus1993
demarneffe2021

Frequently asked questions

Why build treebanks by hand if parsers exist?: Parsers are trained and evaluated against human-annotated treebanks, which serve as the gold standard. Without reliable hand annotation there would be nothing to learn from or to measure accuracy against.