Treebanks and Annotated Corpora
Corpora hand-annotated with linguistic structure — syntactic trees, dependencies, senses, and entities — that serve as training data and gold standards for computational linguistics.
Definition
A treebank is a corpus in which each sentence is annotated with its syntactic structure; more broadly, an annotated corpus carries explicit linguistic labels added by humans.
Scope
Covers the design and construction of annotated corpora, especially treebanks carrying constituency or dependency syntax, and the annotation pipelines, guidelines, and quality control behind them. It includes the Penn Treebank tradition and the cross-lingual Universal Dependencies effort, and the role of inter-annotator agreement. General corpus design and lexical resources are covered in sibling topics.
Core questions
- How are treebanks designed and what annotation schemes do they use?
- Why are annotated corpora indispensable for supervised learning?
- How is annotation quality assured and measured?
- How does cross-lingual annotation like Universal Dependencies achieve consistency?
Key concepts
- treebank
- annotation scheme
- annotation guidelines
- gold standard
- inter-annotator agreement
- Penn Treebank
- Universal Dependencies
- adjudication
Key theories
- Treebank-driven supervised learning
- Hand-annotated syntactic corpora provide the supervision signal that made statistical parsing, tagging, and many NLP tasks possible.
- Cross-lingual harmonized annotation
- Universal Dependencies applies a single annotation scheme across many languages, enabling comparable treebanks and transfer of models.
History
The Penn Treebank (1993) was the first large syntactically annotated corpus and catalyzed statistical parsing. Subsequent treebanks added semantic and discourse layers, and the Universal Dependencies project standardized annotation across languages, becoming the de facto multilingual treebank resource.
Debates
- Annotation depth versus consistency
- Richer annotation captures more linguistic detail but is harder to apply consistently; projects must balance theoretical sophistication against reliable, scalable annotation.
Key figures
- Mitchell Marcus
- Beatrice Santorini
- Marie-Catherine de Marneffe
- Joakim Nivre
Related topics
Seminal works
- marcus1993
- demarneffe2021
Frequently asked questions
- Why build treebanks by hand if parsers exist?
- Parsers are trained and evaluated against human-annotated treebanks, which serve as the gold standard. Without reliable hand annotation there would be nothing to learn from or to measure accuracy against.