Computational Morphology
Modeling the internal structure of words by machine — analysis, generation, stemming, lemmatization, and subword segmentation — from finite-state morphology to the byte-pair encoding used by modern neural systems.
Definition
Computational morphology is the algorithmic analysis and generation of word forms in terms of their constituent morphemes and morphological features.
Scope
Covers the computational treatment of word structure: morphological analysis and generation with finite-state transducers, two-level morphology, stemming and lemmatization, and data-driven subword segmentation such as byte-pair encoding. It addresses inflection, derivation, and compounding across typologically diverse languages. The underlying finite-state machinery is detailed in the foundations area.
Core questions
- How are morphological alternations modeled with finite-state transducers?
- What is the difference between stemming and lemmatization?
- How does subword segmentation handle rare and unseen words in neural models?
- Why is morphology harder for agglutinative and templatic languages?
Key concepts
- morpheme
- inflection and derivation
- two-level morphology
- finite-state transducer
- stemming
- lemmatization
- byte-pair encoding
- agglutination
Key theories
- Two-level morphology
- Koskenniemi's model relating surface and lexical word forms through parallel finite-state rules, enabling a single grammar to both analyze and generate forms.
- Data-driven subword segmentation
- Learning a vocabulary of frequent character sequences, as in byte-pair encoding, so neural models can represent any word as a sequence of subword units.
History
Koskenniemi's 1983 two-level morphology established finite-state methods as the standard for morphological processing, consolidated in Beesley and Karttunen's handbook. As neural models rose, hand-built morphological analyzers were complemented by learned subword segmentation such as byte-pair encoding, which sidesteps explicit morphology while handling rare words.
Debates
- Explicit morphology versus subword units
- Whether neural systems need linguistically informed morphological analysis or whether statistical subword segmentation suffices; the answer appears to depend on language type and data scale.
Key figures
- Kimmo Koskenniemi
- Lauri Karttunen
- Kenneth Beesley
- Rico Sennrich
Related topics
Seminal works
- koskenniemi1983
- beesley2003
- sennrich2016
Frequently asked questions
- What is the difference between stemming and lemmatization?
- Stemming crudely chops affixes to a common stem (e.g., 'studies' to 'studi'), while lemmatization maps a word to its dictionary form using morphological knowledge (e.g., 'studies' to 'study').