Regular Expressions and Finite-State Methods
Practical techniques built on regular languages — pattern matching with regular expressions and string-to-string mapping with finite-state transducers — that handle tokenization, normalization, and morphological analysis efficiently.
Definition
Finite-state methods are language-processing techniques in which patterns and mappings are expressed as regular expressions or finite-state automata and transducers, guaranteeing efficient linear-time recognition.
Scope
Covers regular expressions as a pattern language over strings, finite-state automata and transducers as their computational realization, and their application to text normalization, tokenization, spelling, and computational morphology. It includes weighted finite-state methods used in speech and shallow processing. Full phonological theory and deep syntactic parsing are out of scope.
Core questions
- How can regular expressions specify and extract textual patterns precisely?
- How do finite-state transducers map surface forms to lexical analyses, as in morphology?
- Why are finite-state methods preferred for tokenization and normalization?
Key concepts
- regular expression
- finite-state transducer
- tokenization
- text normalization
- morphological analysis
- two-level morphology
- weighted automata
- edit distance
Key theories
- Regular models of morphology and phonology
- The result that phonological rewrite rules and morphological alternations can be compiled into finite-state transducers, making analysis and generation a single efficient framework.
- Equivalence of regular expressions and finite automata
- Regular expressions, regular grammars, and finite-state automata all describe exactly the regular languages, so a declarative pattern can be compiled into an efficient recognizer.
History
Regular expressions entered computing from Kleene's work and became ubiquitous in text tools. In the 1980s Koskenniemi's two-level morphology and Kaplan and Kay's compilation of phonological rules into transducers established finite-state technology as the workhorse of morphological processing, an approach consolidated in Beesley and Karttunen's handbook.
Debates
- How far can finite-state methods scale?
- Finite-state techniques are extremely efficient but limited to regular phenomena; the debate concerns which language-processing tasks remain best served by them versus richer statistical or neural models.
Key figures
- Martin Kay
- Ronald Kaplan
- Kimmo Koskenniemi
- Lauri Karttunen
Related topics
Seminal works
- kaplan1994
- beesley2003
Frequently asked questions
- Why use a finite-state transducer instead of just a lookup table for morphology?
- A transducer compactly encodes systematic alternations and can analyze or generate word forms it has never seen, whereas a table only stores forms explicitly listed in it.