What makes a good corpus?

A good corpus is large enough for reliable statistics and representative of the language variety being studied, with clear documentation of its sources, sampling, and any annotation so results can be interpreted and reproduced.

Lexical and Corpus Resources

The data and knowledge bases that empirical computational linguistics depends on: text corpora, lexical databases and ontologies, computational treatments of word structure, and richly annotated treebanks.

Definition

Lexical and corpus resources are structured collections of language data — texts, lexicons, and annotations — built to support empirical analysis and the training of language-processing systems.

Scope

Covers the construction, curation, and use of language resources — balanced and web corpora, lexical-semantic databases such as WordNet, computational morphology and lexicons, and annotated treebanks. It addresses corpus design, representativeness, annotation standards, and the role of resources in training and evaluating systems. Algorithmic modeling that consumes these resources is covered in other areas.

Sub-topics

Core questions

How are corpora designed to be representative and balanced?
How can word meanings be organized into machine-readable lexical databases?
How is word structure represented computationally across morphologically rich languages?
Why are annotated treebanks central to data-driven linguistics?

Key concepts

corpus
representativeness
lexical database
WordNet
synset
morphological lexicon
treebank
annotation standard

Key theories

Corpus-based empiricism: The methodological stance that linguistic generalizations and system parameters should be grounded in large samples of attested usage rather than introspection alone.
Lexical-semantic networks: Organizing the lexicon as a graph of senses linked by relations such as synonymy and hypernymy, as in WordNet, supporting tasks from disambiguation to semantic similarity.

History

The shift to empirical methods in the 1990s made corpora and lexical resources foundational. WordNet provided a reusable lexical-semantic database, balanced corpora like the British National Corpus set design standards, and Kilgarriff and Grefenstette's work legitimized the Web itself as a vast corpus for linguistic study.

Debates

Balanced corpora versus the Web as corpus: Whether carefully balanced corpora or the messy but enormous Web better serve linguistic inquiry; the field increasingly uses both, weighing representativeness against scale.

Key figures

Christiane Fellbaum
Adam Kilgarriff
Christopher Manning
George Miller

Seminal works

fellbaum1998
kilgarriff2003
manning1999

Frequently asked questions

What makes a good corpus?: A good corpus is large enough for reliable statistics and representative of the language variety being studied, with clear documentation of its sources, sampling, and any annotation so results can be interpreted and reproduced.