Corpus Linguistics and Web Corpora
Studying language through large samples of authentic text: building and querying corpora, measuring collocations and frequencies, and harnessing the Web as a vast linguistic resource.
Definition
Corpus linguistics is the empirical study of language based on systematic collections of naturally occurring text, analyzed with frequency, concordance, and association measures.
Scope
Covers the design, compilation, and analysis of text corpora — sampling and balance, concordancing and keyword analysis, frequency and collocation statistics such as mutual information, and the use of the Web as a corpus. It addresses both descriptive corpus linguistics and the supply of data for computational systems. Annotation schemes and treebanks are covered in a sibling topic.
Core questions
- How are corpora sampled to represent a language variety fairly?
- How do association measures like mutual information reveal collocations?
- What are the benefits and pitfalls of using the Web as a corpus?
- How do concordances support linguistic and lexicographic analysis?
Key concepts
- corpus design
- concordance
- collocation
- pointwise mutual information
- frequency distribution
- keyword analysis
- Web as corpus
- balanced corpus
Key theories
- Association measures for collocation
- Using statistics such as pointwise mutual information to detect word pairs that co-occur more than chance, revealing collocations and supporting lexicography.
- Web as corpus
- Treating the Web as an enormous, if uncontrolled, corpus, enabling study of rare phenomena and low-resource varieties while raising questions of representativeness.
History
Corpus linguistics grew from Sinclair's lexicographic projects and the construction of balanced corpora, while Church and Hanks's 1989 work on mutual information brought statistical association measures into the mainstream. Kilgarriff and Grefenstette later established the Web as a legitimate, if noisy, corpus of unprecedented scale.
Debates
- Representativeness of Web data
- Web corpora are huge but unbalanced and hard to characterize, prompting debate over how far conclusions drawn from them generalize to a language as a whole.
Key figures
- Adam Kilgarriff
- Kenneth Church
- Patrick Hanks
- John Sinclair
Related topics
Seminal works
- church1989
- kilgarriff2003
Frequently asked questions
- What is a collocation?
- A collocation is a pair or group of words that habitually occur together more often than chance would predict, such as 'strong tea' rather than 'powerful tea'. Association measures help detect them automatically.