ScholarGate
Assistant

Corpus Linguistics and Web Corpora

Studying language through large samples of authentic text: building and querying corpora, measuring collocations and frequencies, and harnessing the Web as a vast linguistic resource.

Definition

Corpus linguistics is the empirical study of language based on systematic collections of naturally occurring text, analyzed with frequency, concordance, and association measures.

Scope

Covers the design, compilation, and analysis of text corpora — sampling and balance, concordancing and keyword analysis, frequency and collocation statistics such as mutual information, and the use of the Web as a corpus. It addresses both descriptive corpus linguistics and the supply of data for computational systems. Annotation schemes and treebanks are covered in a sibling topic.

Core questions

  • How are corpora sampled to represent a language variety fairly?
  • How do association measures like mutual information reveal collocations?
  • What are the benefits and pitfalls of using the Web as a corpus?
  • How do concordances support linguistic and lexicographic analysis?

Key concepts

  • corpus design
  • concordance
  • collocation
  • pointwise mutual information
  • frequency distribution
  • keyword analysis
  • Web as corpus
  • balanced corpus

Key theories

Association measures for collocation
Using statistics such as pointwise mutual information to detect word pairs that co-occur more than chance, revealing collocations and supporting lexicography.
Web as corpus
Treating the Web as an enormous, if uncontrolled, corpus, enabling study of rare phenomena and low-resource varieties while raising questions of representativeness.

History

Corpus linguistics grew from Sinclair's lexicographic projects and the construction of balanced corpora, while Church and Hanks's 1989 work on mutual information brought statistical association measures into the mainstream. Kilgarriff and Grefenstette later established the Web as a legitimate, if noisy, corpus of unprecedented scale.

Debates

Representativeness of Web data
Web corpora are huge but unbalanced and hard to characterize, prompting debate over how far conclusions drawn from them generalize to a language as a whole.

Key figures

  • Adam Kilgarriff
  • Kenneth Church
  • Patrick Hanks
  • John Sinclair

Related topics

Seminal works

  • church1989
  • kilgarriff2003

Frequently asked questions

What is a collocation?
A collocation is a pair or group of words that habitually occur together more often than chance would predict, such as 'strong tea' rather than 'powerful tea'. Association measures help detect them automatically.

Methods for this concept

Related concepts