ScholarGate
Assistent

Võrdle meetodeid

Vaata valitud meetodeid kõrvuti; erinevad read on esile tõstetud.

Handwritten Text Recognition for Archives×Historical Corpus Text Mining×
ValdkondDigital HistoryDigital History
PerekondMachine learningProcess / pipeline
Tekkeaasta20192013
LoojaTranskribus and the READ projectFranco Moretti
Tüüpml-recognition-pipelinetext-analysis-pipeline
AlgallikasMuehlberger, G., Seaward, L., Terras, M., et al. (2019). Transforming scholarship in the archives through handwritten text recognition: Transkribus as a case study. Journal of Documentation, 75(5), 954-976. DOI ↗Moretti, F. (2013). Distant Reading. Verso. ISBN: 9781781680841
RööpnimetusedHTR, Manuscript transcription AI, Automatic handwriting transcription, Neural archival transcriptionDistant reading, Computational historical text analysis, Macroanalysis of corpora, Corpus-scale historical NLP
Seotud33
KokkuvõteHandwritten text recognition for archives converts digital images of manuscript pages into searchable, machine-readable text, unlocking the vast holdings of handwritten material that optical character recognition, designed for print, cannot read. Exemplified by platforms such as Transkribus, developed in the READ project, modern HTR uses deep neural networks trained on transcribed examples to recognize the highly variable scripts of letters, registers, charters, and notebooks across centuries and languages. The pipeline first analyzes page layout and segments the image into text regions and lines, then a recurrent or transformer-based recognizer decodes each line into characters, typically using connectionist temporal classification to align pixels with text without needing character-level segmentation. Crucially, recognition models are trained and improved on ground-truth transcriptions supplied by scholars, so accuracy rises as more material is annotated. By making manuscripts machine-readable at scale, HTR is the gateway technology of digital archival history, feeding full-text search, named-entity recognition, and large-corpus text mining of sources that were previously legible only page by page.Historical corpus text mining applies computational methods to thousands or millions of historical documents at once, seeking macro-scale patterns that close reading of individual texts could never reveal. Associated above all with Franco Moretti's program of distant reading, the approach treats large bodies of text, newspapers, parliamentary records, novels, correspondence, as data to be measured rather than works to be interpreted one by one. By counting word frequencies, computing weighted term importance, fitting topic models, and tracking how vocabulary shifts across decades, researchers can chart the rise and fall of concepts, the diffusion of ideas, and the changing texture of public discourse over long spans. The method is explicitly quantitative and aggregative: its claims concern populations of documents, not exemplary passages. Adapting modern natural-language processing to historical material, however, requires confronting archaic spelling, OCR noise, and shifting word meanings. Done carefully, corpus text mining turns vast unread archives into evidence about how language, and the thought it carries, evolved historically.
ScholarGateAndmestik
  1. v1
  2. 2 Allikad
  3. PUBLISHED
  1. v1
  2. 2 Allikad
  3. PUBLISHED

Mine otsingusse Laadi slaidid alla

ScholarGateVõrdle meetodeid: Handwritten Text Recognition for Archives · Historical Corpus Text Mining. Loetud 2026-06-25 aadressilt https://scholargate.app/et/compare