Stylometry and Authorship Attribution
Writers leave statistical fingerprints. The frequencies of small, unconscious words — the, of, and — vary little within an author's work but differ between authors, and stylometry exploits this to settle disputed authorship and to study style quantitatively.
Definition
The statistical analysis of measurable features of writing style to characterize authors and to attribute texts of uncertain or disputed authorship.
Scope
Covers the quantitative measurement of literary style and its use in attributing texts to authors: the choice of stylistic features, distance and classification measures such as Burrows's Delta, and the validation of attribution claims. Includes the field's history from the Federalist Papers to modern machine-learning methods, and its forensic applications.
Core questions
- Which textual features best capture an author's distinctive style?
- How can attribution claims be tested and validated?
- Why are function-word frequencies so effective for attribution?
- What are the limits of stylometry across genres, periods, and translation?
Key concepts
- Function words
- Burrows's Delta
- Feature selection
- Classification
- Cross-validation
Key theories
- Function-word frequency as authorial signal
- Mosteller and Wallace showed that frequencies of common function words could discriminate authors, using Bayesian inference to attribute the disputed Federalist Papers.
- Burrows's Delta
- Burrows introduced Delta, a distance measure over the most frequent words that has become a standard, robust method for ranking candidate authors.
- Modern attribution as classification
- Stamatatos surveyed how authorship attribution is framed as a text-classification problem, comparing feature sets and machine-learning methods.
History
Quantitative authorship study dates to the nineteenth century, but Mosteller and Wallace's 1964 study of the Federalist Papers established the modern statistical approach. Burrows's Delta (2002) gave the field a widely adopted measure, and surveys such as Stamatatos (2009) mapped the shift to machine-learning classification and forensic use.
Debates
- Reliability and confidence of attributions
- Stylometric methods can be powerful yet sensitive to corpus size, genre, and preprocessing, raising questions about how much confidence attributions deserve, especially in forensic contexts.
Key figures
- Frederick Mosteller
- David Wallace
- John Burrows
- Efstathios Stamatatos
Related topics
Seminal works
- mosteller1964
- burrows2002
- stamatatos2009
Frequently asked questions
- Why focus on tiny words like 'the' instead of distinctive vocabulary?
- Distinctive vocabulary often reflects a text's topic rather than its author. Common function words are used unconsciously and at stable rates within an author's writing but differ between authors, making them a reliable, topic-independent signal of style.