Process / pipelineEmpirical Linguistics

Corpus Linguistics

Corpus Linguistics Analysis Method · Also known as: Corpus Analysis, Corpora Studies

Corpus Linguistics is the study of language based on large, representative collections of texts (corpora) processed by computer. Pioneered by John Sinclair and others, the method uses statistical analysis, concordancing, and computational tools to examine patterns of actual language use. Corpus linguistics has transformed our understanding of English and other languages, revealing frequency patterns, collocation preferences, and register variation that were previously hidden. It serves theoretical linguistics, applied language teaching, and natural language processing.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Corpus Linguistics

Dialectometry Acoustic Phonetics Comparative Method Corpus Concordance Analy…Linguistic Ethnography

When to use it

Use corpus linguistics to investigate actual language use patterns, to test theoretical hypotheses with empirical data, to understand variation across registers and dialects, or to develop language resources for NLP applications. Corpora are essential for studying high-frequency phenomena, dialect variation, language change, and lexical semantics. They complement experimental and introspective methods.

Strengths & limitations

Strengths

Provides empirical evidence for language patterns based on actual use, reducing reliance on introspection and invented sentences.
Enables quantitative analysis of frequency, collocation, and variation across large, diverse datasets.
Facilitates cross-linguistic and cross-register comparison by using standardized, comparable corpora.
Supports discovery of patterns (e.g., semantic preferences, register markers) that would be difficult to predict without data.

Limitations

Corpora are finite samples; low-frequency phenomena may not be represented adequately, and absence from a corpus does not prove a form is impossible.
Annotation (POS tagging, parsing) introduces errors that propagate downstream. Manual annotation is expensive; automatic annotation is imperfect.
Results reflect corpus composition (e.g., written corpora may not represent spoken language equally); biases in selection affect findings.
Frequency patterns do not directly reveal competence; high frequency may reflect performance factors like discourse conventions rather than grammatical structure.

Frequently asked

How large does a corpus need to be?

It depends on your research question. For high-frequency phenomena (common words, basic sentence types), tens of thousands of words suffice. For rare phenomena or detailed variation, millions of words may be needed. A good rule of thumb: you should have at least 30-50 occurrences of your target phenomenon for basic statistical analysis. However, even small specialized corpora (thousands of words) are valuable for endangered language documentation.

What is a concordance?

A concordance is a list of all occurrences of a target word or phrase, typically shown in context (usually one or two lines on either side). The target is centered for easy scanning. Concordances reveal collocations (words that regularly appear near the target), register variation, and syntactic patterns. Modern corpus software generates concordances with frequency counts and sorting options.

How do I measure collocation strength?

Several metrics quantify collocation strength (association). Mutual Information (MI) measures deviation from random co-occurrence; Log-Likelihood Ratio (LLR) is a statistical test; T-score and Delta-P measure association differently. No single metric is universally best; different metrics suit different research questions. Inspect raw frequencies alongside metrics to avoid artifacts where low-frequency collocations show high association by chance.

Can corpus evidence prove a sentence is ungrammatical?

No. Absence from a corpus shows it is infrequent or absent in that particular collection, not that it is impossible in the language. Sentences may be rare due to pragmatic or discourse constraints, not grammaticality. Combine corpus evidence with acceptability judgments, experimental data, and theoretical reasoning to draw conclusions about grammaticality.

Sources

Sinclair, J. M. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press. link ↗
McEnery, T., & Hardie, A. (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511981395 ↗
Biber, D., Conrad, S., & Reppen, R. (2006). Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press. link ↗

How to cite this page

ScholarGate. (2026, June 3). Corpus Linguistics Analysis Method. ScholarGate. https://scholargate.app/en/linguistics/corpus-linguistics

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

DialectometryLinguistics↔ compare

Compare side by side →

Referenced by

Acoustic Phonetics Comparative Method Corpus Concordance Analysis Dialectometry Linguistic Ethnography

Related reference concepts

Corpus Linguistics and Web Corpora Lexical and Corpus Resources Computational Linguistics Historical Corpora and Attested Records Evaluation and Annotation Computational Text Analysis

Spotted an issue on this page? Report or suggest a fix →

Process / pipelineEmpirical Linguistics

Corpus Linguistics

Corpus Linguistics Analysis Method · Also known as: Corpus Analysis, Corpora Studies

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Corpus Linguistics

Dialectometry Acoustic Phonetics Comparative Method Corpus Concordance Analy…Linguistic Ethnography

When to use it

Strengths & limitations

Strengths

Provides empirical evidence for language patterns based on actual use, reducing reliance on introspection and invented sentences.
Enables quantitative analysis of frequency, collocation, and variation across large, diverse datasets.
Facilitates cross-linguistic and cross-register comparison by using standardized, comparable corpora.
Supports discovery of patterns (e.g., semantic preferences, register markers) that would be difficult to predict without data.

Limitations

Corpora are finite samples; low-frequency phenomena may not be represented adequately, and absence from a corpus does not prove a form is impossible.
Annotation (POS tagging, parsing) introduces errors that propagate downstream. Manual annotation is expensive; automatic annotation is imperfect.
Results reflect corpus composition (e.g., written corpora may not represent spoken language equally); biases in selection affect findings.
Frequency patterns do not directly reveal competence; high frequency may reflect performance factors like discourse conventions rather than grammatical structure.

Frequently asked

How large does a corpus need to be?

What is a concordance?

How do I measure collocation strength?

Can corpus evidence prove a sentence is ungrammatical?

Sources

Sinclair, J. M. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press. link ↗
McEnery, T., & Hardie, A. (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press. DOI: 10.1017/CBO9780511981395 ↗
Biber, D., Conrad, S., & Reppen, R. (2006). Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press. link ↗

How to cite this page

ScholarGate. (2026, June 3). Corpus Linguistics Analysis Method. ScholarGate. https://scholargate.app/en/linguistics/corpus-linguistics

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

DialectometryLinguistics↔ compare

Compare side by side →

Referenced by

Acoustic Phonetics Comparative Method Corpus Concordance Analysis Dialectometry Linguistic Ethnography

Similar methods

Related reference concepts

Corpus Linguistics and Web Corpora Lexical and Corpus Resources Computational Linguistics Historical Corpora and Attested Records Evaluation and Annotation Computational Text Analysis

Spotted an issue on this page? Report or suggest a fix →