What is a collocation?

A collocation is a pair or group of words that habitually occur together more often than chance would predict, such as 'strong tea' rather than 'powerful tea'. Association measures help detect them automatically.

Corpus Linguistics and Web Corpora

Studying language through large samples of authentic text: building and querying corpora, measuring collocations and frequencies, and harnessing the Web as a vast linguistic resource.

Definition

Corpus linguistics is the empirical study of language based on systematic collections of naturally occurring text, analyzed with frequency, concordance, and association measures.

Scope

Covers the design, compilation, and analysis of text corpora — sampling and balance, concordancing and keyword analysis, frequency and collocation statistics such as mutual information, and the use of the Web as a corpus. It addresses both descriptive corpus linguistics and the supply of data for computational systems. Annotation schemes and treebanks are covered in a sibling topic.

Core questions

How are corpora sampled to represent a language variety fairly?
How do association measures like mutual information reveal collocations?
What are the benefits and pitfalls of using the Web as a corpus?
How do concordances support linguistic and lexicographic analysis?

Key concepts

corpus design
concordance
collocation
pointwise mutual information
frequency distribution
keyword analysis
Web as corpus
balanced corpus

Key theories

Association measures for collocation: Using statistics such as pointwise mutual information to detect word pairs that co-occur more than chance, revealing collocations and supporting lexicography.
Web as corpus: Treating the Web as an enormous, if uncontrolled, corpus, enabling study of rare phenomena and low-resource varieties while raising questions of representativeness.

History

Corpus linguistics grew from Sinclair's lexicographic projects and the construction of balanced corpora, while Church and Hanks's 1989 work on mutual information brought statistical association measures into the mainstream. Kilgarriff and Grefenstette later established the Web as a legitimate, if noisy, corpus of unprecedented scale.

Debates

Representativeness of Web data: Web corpora are huge but unbalanced and hard to characterize, prompting debate over how far conclusions drawn from them generalize to a language as a whole.

Key figures

Adam Kilgarriff
Kenneth Church
Patrick Hanks
John Sinclair

Seminal works

church1989
kilgarriff2003

Frequently asked questions

What is a collocation?: A collocation is a pair or group of words that habitually occur together more often than chance would predict, such as 'strong tea' rather than 'powerful tea'. Association measures help detect them automatically.