Compare methods

Review your selected methods side by side; rows that differ are highlighted.

	Handwritten Text Recognition for Archives ×	Historical Corpus Text Mining ×
Field	Digital History	Digital History
Family≠	Machine learning	Process / pipeline
Year of origin≠	2019	2013
Originator≠	Transkribus and the READ project	Franco Moretti
Type≠	ml-recognition-pipeline	text-analysis-pipeline
Seminal source≠	Muehlberger, G., Seaward, L., Terras, M., et al. (2019). Transforming scholarship in the archives through handwritten text recognition: Transkribus as a case study. Journal of Documentation, 75(5), 954-976. DOI ↗	Moretti, F. (2013). Distant Reading. Verso. ISBN: 9781781680841
Aliases	HTR, Manuscript transcription AI, Automatic handwriting transcription, Neural archival transcription	Distant reading, Computational historical text analysis, Macroanalysis of corpora, Corpus-scale historical NLP
Related	3	3
Summary≠	Handwritten text recognition for archives converts digital images of manuscript pages into searchable, machine-readable text, unlocking the vast holdings of handwritten material that optical character recognition, designed for print, cannot read. Exemplified by platforms such as Transkribus, developed in the READ project, modern HTR uses deep neural networks trained on transcribed examples to recognize the highly variable scripts of letters, registers, charters, and notebooks across centuries and languages. The pipeline first analyzes page layout and segments the image into text regions and lines, then a recurrent or transformer-based recognizer decodes each line into characters, typically using connectionist temporal classification to align pixels with text without needing character-level segmentation. Crucially, recognition models are trained and improved on ground-truth transcriptions supplied by scholars, so accuracy rises as more material is annotated. By making manuscripts machine-readable at scale, HTR is the gateway technology of digital archival history, feeding full-text search, named-entity recognition, and large-corpus text mining of sources that were previously legible only page by page.	Historical corpus text mining applies computational methods to thousands or millions of historical documents at once, seeking macro-scale patterns that close reading of individual texts could never reveal. Associated above all with Franco Moretti's program of distant reading, the approach treats large bodies of text, newspapers, parliamentary records, novels, correspondence, as data to be measured rather than works to be interpreted one by one. By counting word frequencies, computing weighted term importance, fitting topic models, and tracking how vocabulary shifts across decades, researchers can chart the rise and fall of concepts, the diffusion of ideas, and the changing texture of public discourse over long spans. The method is explicitly quantitative and aggregative: its claims concern populations of documents, not exemplary passages. Adapting modern natural-language processing to historical material, however, requires confronting archaic spelling, OCR noise, and shifting word meanings. Done carefully, corpus text mining turns vast unread archives into evidence about how language, and the thought it carries, evolved historically.
ScholarGateDataset ↗	v1 2 Sources PUBLISHED	v1 2 Sources PUBLISHED

Go to search → Download slides