Lexical and Corpus Resources
The data and knowledge bases that empirical computational linguistics depends on: text corpora, lexical databases and ontologies, computational treatments of word structure, and richly annotated treebanks.
Definition
Lexical and corpus resources are structured collections of language data — texts, lexicons, and annotations — built to support empirical analysis and the training of language-processing systems.
Scope
Covers the construction, curation, and use of language resources — balanced and web corpora, lexical-semantic databases such as WordNet, computational morphology and lexicons, and annotated treebanks. It addresses corpus design, representativeness, annotation standards, and the role of resources in training and evaluating systems. Algorithmic modeling that consumes these resources is covered in other areas.
Sub-topics
Core questions
- How are corpora designed to be representative and balanced?
- How can word meanings be organized into machine-readable lexical databases?
- How is word structure represented computationally across morphologically rich languages?
- Why are annotated treebanks central to data-driven linguistics?
Key concepts
- corpus
- representativeness
- lexical database
- WordNet
- synset
- morphological lexicon
- treebank
- annotation standard
Key theories
- Corpus-based empiricism
- The methodological stance that linguistic generalizations and system parameters should be grounded in large samples of attested usage rather than introspection alone.
- Lexical-semantic networks
- Organizing the lexicon as a graph of senses linked by relations such as synonymy and hypernymy, as in WordNet, supporting tasks from disambiguation to semantic similarity.
History
The shift to empirical methods in the 1990s made corpora and lexical resources foundational. WordNet provided a reusable lexical-semantic database, balanced corpora like the British National Corpus set design standards, and Kilgarriff and Grefenstette's work legitimized the Web itself as a vast corpus for linguistic study.
Debates
- Balanced corpora versus the Web as corpus
- Whether carefully balanced corpora or the messy but enormous Web better serve linguistic inquiry; the field increasingly uses both, weighing representativeness against scale.
Key figures
- Christiane Fellbaum
- Adam Kilgarriff
- Christopher Manning
- George Miller
Related topics
Seminal works
- fellbaum1998
- kilgarriff2003
- manning1999
Frequently asked questions
- What makes a good corpus?
- A good corpus is large enough for reliable statistics and representative of the language variety being studied, with clear documentation of its sources, sampling, and any annotation so results can be interpreted and reproduced.