ScholarGate
Assistent

Text Encoding and Markup

Before a text can be analyzed, searched, or rendered by a computer, it must be represented in a machine-readable form. Text encoding is the practice of adding structured markup to documents so that their features — structure, language, editorial apparatus, named entities — become explicit and computable.

Onderwerp vinden met PaperMindBinnenkortFind papers & topics
Tools & resources
Dia's downloaden
Learn & explore
VideoBinnenkort

Definition

The systematic application of structured, machine-readable markup to a text in order to make its features explicit, interchangeable, and amenable to computational processing and scholarly analysis.

Scope

Covers the theory and practice of representing humanities texts in machine-readable form: the Text Encoding Initiative (TEI) and its guidelines, markup languages such as XML, document modeling and schema design, metadata standards and controlled vocabularies, and the encoding of born-digital and electronic literature. Includes foundational debates about the nature of text and the consequences of treating documents as ordered hierarchies.

Sub-topics

Core questions

  • What is a text, and which of its features should encoding make explicit?
  • How do markup standards such as TEI balance expressive power against interoperability?
  • What interpretive choices are embedded in any decision to encode a document one way rather than another?
  • How should metadata and controlled vocabularies describe and connect encoded resources?

Key concepts

  • Markup
  • Schema
  • Element and attribute
  • Document Type Definition
  • Overlapping hierarchies
  • Interoperability

Key theories

Text as an Ordered Hierarchy of Content Objects (OHCO)
DeRose and colleagues argued that texts are best modeled as nested hierarchies of logical objects such as chapters, paragraphs, and sentences, a view that underpinned descriptive markup but also provoked debate about overlapping structures.
Descriptive markup
Encoding should describe what a textual feature is rather than how it should appear, separating logical structure from presentation so that the same source can support analysis, search, and rendering.
Interchange through community standards
The TEI provides a shared, extensible vocabulary so that encoded texts can be exchanged and reused across projects, making interoperability a core goal of humanities markup.

History

Structured text markup emerged from publishing and computing in the 1960s and 1970s, leading to SGML and later XML. The Text Encoding Initiative, founded in 1987, produced community guidelines for encoding humanities texts; the OHCO debates of the early 1990s clarified what it means to model a text. TEI P5 and successive revisions consolidated encoding practice across digital editing, corpus building, and archival projects.

Debates

Whether text is fundamentally hierarchical
The OHCO thesis was challenged by the prevalence of overlapping structures such as quotations spanning paragraph boundaries, prompting alternative models and standoff markup.

Key figures

  • Allen Renear
  • Lou Burnard
  • Steven DeRose
  • C. M. Sperberg-McQueen

Related topics

Seminal works

  • delittle1990
  • tei2024
  • renear2004
  • burnard2014

Frequently asked questions

Why not just store texts as plain files or Word documents?
Plain or word-processor files mix content with presentation and leave structure implicit. Encoding makes features such as headings, names, and editorial notes explicit and machine-readable, so the same source can be searched, analyzed, and rendered in many ways and shared across projects.

Methods for this concept

Related concepts