How is this different from corpus linguistics or NLP?

It shares techniques with corpus linguistics and natural language processing but is driven by humanistic questions — literary history, authorship, cultural change — rather than by modeling language itself or building applications. The interpretive goals, and the debates about them, are characteristic of the digital humanities.

Computational Text Analysis

When literary and historical questions are posed at the scale of thousands or millions of texts, computation becomes a method of reading. This area gathers the quantitative techniques the digital humanities use to find patterns in large textual corpora — and the lively debate over what those patterns mean.

Definition

The application of quantitative and computational techniques to large collections of humanities texts in order to detect patterns, model literary or historical change, and pose interpretive questions at scales beyond close reading.

Scope

Covers quantitative and computational methods applied to humanities texts: distant reading and macroanalysis, stylometry and authorship attribution, topic modeling and text mining, and the building of the corpora these methods require. Includes methodological debates about the validity and interpretive value of computational literary studies. Distinct from corpus linguistics and natural language processing, which sit in linguistics and computer science.

Sub-topics

Core questions

What can large-scale quantitative analysis reveal that close reading cannot?
How reliable and interpretable are the patterns computation finds in texts?
How do corpus construction and preprocessing shape results?
How should computational evidence relate to literary and historical interpretation?

Key concepts

Distant reading
Corpus
Feature
Statistical model
Preprocessing
Interpretation at scale

Key theories

Distant reading: Moretti proposed studying literature through large-scale patterns and abstractions rather than the close reading of a canonical few, reframing literary history as a problem of scale.
Macroanalysis: Jockers argued that digital methods enable a statistical literary history of entire corpora, revealing influence and stylistic structure invisible at the level of single works.
Modeling literary change: Underwood used predictive modeling of large collections to argue that categories such as genre and prestige often change gradually and continuously.

History

Roots lie in mid-twentieth-century concordance building and humanities computing. Moretti's distant reading (2000s), Jockers's Macroanalysis (2013), and Underwood's Distant Horizons (2019) consolidated computational literary studies, while Da's 2019 critique sharpened debate over statistical rigor and interpretive payoff.

Debates

Statistical rigor versus interpretive value: Da argued that much computational literary work is statistically weak or interpretively thin; defenders contend that the methods open genuinely new questions when used carefully.

Key figures

Franco Moretti
Matthew L. Jockers
Ted Underwood
Nan Z. Da

Seminal works

moretti2013
jockers2013
underwood2019
da2019

Frequently asked questions

How is this different from corpus linguistics or NLP?: It shares techniques with corpus linguistics and natural language processing but is driven by humanistic questions — literary history, authorship, cultural change — rather than by modeling language itself or building applications. The interpretive goals, and the debates about them, are characteristic of the digital humanities.