Why can't I just download a big pile of texts and analyze them?

Because the composition of that pile determines your results. Available collections are uneven and biased toward what has been digitized, and uncorrected OCR introduces errors. Documenting selection, provenance, and processing is essential for interpreting and trusting any computational finding.

Corpus Building and Curation

Every computational reading depends on a corpus, and no corpus is neutral. Choices about what to include, how to clean and structure the texts, and which metadata to attach shape every result that follows — making corpus construction a scholarly act in its own right.

Definition

The principled assembly, processing, documentation, and maintenance of text collections used for computational analysis, together with critical attention to how those collections are selected and shaped.

Scope

Covers the construction and stewardship of text corpora for computational analysis: selection and sampling, cleaning and normalization, optical character recognition and transcription, metadata, and documentation. Includes critical reflection on representativeness, bias, and the constructed nature of humanities datasets. Treated here from a digital-humanities perspective rather than as corpus linguistics.

Core questions

What does it mean for a corpus to represent a body of literature or history?
How do cleaning, OCR, and normalization decisions affect downstream analysis?
What metadata and documentation does a reusable corpus need?
Whose texts are missing from available digital collections, and why?

Key concepts

Sampling
Representativeness
OCR
Normalization
Provenance
Documentation

Key theories

Data as constructed, not given: Gitelman and contributors argued that data is always made — selected, cleaned, framed — so 'raw data' is a misnomer and every dataset carries the assumptions of its construction.
Representativeness and the literary corpus: Underwood discussed how the composition and biases of digital collections shape claims about literary change, making sampling and provenance central methodological concerns.
Collections as scholarly arguments: Bode argued that the digital collections underlying computational literary history are themselves interpretive constructs, and that scholars must account for how a collection was built.

History

As computational text analysis grew, scholars increasingly recognized that results depend on the corpora behind them. Gitelman's 2013 volume challenged the idea of neutral data; Bode (2018) and Underwood (2019) made the construction and bias of literary collections explicit, establishing corpus curation as a methodological and critical concern.

Debates

Representativeness versus availability: Corpora are often built from whatever has been digitized, which skews toward certain languages, periods, and canonical works, raising the question of how far conclusions can generalize.

Key figures

Ted Underwood
Katherine Bode
Lisa Gitelman

Seminal works

gitelman2013
bode2018
underwood2019

Frequently asked questions

Why can't I just download a big pile of texts and analyze them?: Because the composition of that pile determines your results. Available collections are uneven and biased toward what has been digitized, and uncorrected OCR introduces errors. Documenting selection, provenance, and processing is essential for interpreting and trusting any computational finding.