String Matching
String matching finds all occurrences of a pattern within a text; classic algorithms preprocess the pattern to skip redundant comparisons and achieve linear or sublinear search time.
Definition
String matching is the problem of finding all positions in a text where a given pattern string occurs; an exact string-matching algorithm reports every offset at which the pattern aligns identically with a substring of the text.
Scope
This topic covers exact single-pattern matching algorithms — the naive method, Knuth-Morris-Pratt, Boyer-Moore, and the hashing-based Rabin-Karp — together with the preprocessing (failure functions, shift tables, rolling hashes) that makes them efficient, and the extension to multiple-pattern matching via the Aho-Corasick automaton. Approximate matching and index-based search are covered in related topics.
Core questions
- Why does naive matching take time proportional to the product of pattern and text lengths?
- How does preprocessing the pattern avoid re-examining text characters?
- How do the Knuth-Morris-Pratt failure function and Boyer-Moore shift rules work?
- How does Rabin-Karp use hashing to test alignments quickly?
- How is matching extended to many patterns simultaneously?
Key concepts
- naive matching
- Knuth-Morris-Pratt algorithm
- failure function
- Boyer-Moore algorithm
- bad-character and good-suffix rules
- Rabin-Karp algorithm
- rolling hash
- Aho-Corasick automaton
Key theories
- Knuth-Morris-Pratt failure function
- By precomputing, for each pattern prefix, the longest proper prefix that is also a suffix, the Knuth-Morris-Pratt algorithm shifts the pattern intelligently after a mismatch and never re-examines text characters, giving linear-time matching.
- Boyer-Moore right-to-left scanning
- Boyer-Moore compares the pattern from its last character backward and uses bad-character and good-suffix rules to skip large portions of the text, achieving sublinear average-case performance on typical inputs.
Clinical relevance
String matching underlies everyday tools and infrastructure: search in text editors and command-line utilities, content filtering and intrusion-detection signature scanning, plagiarism detection, and the location of motifs in biological sequences. Multiple-pattern matching scales these to dictionaries of many patterns at once.
History
The Knuth-Morris-Pratt algorithm (published 1977, developed earlier) gave the first widely known linear-time exact matcher, and Boyer and Moore (1977) introduced sublinear average-case matching the same year. Rabin and Karp's hashing approach and the Aho-Corasick multi-pattern automaton (1975) further broadened the field.
Key figures
- Donald Knuth
- James H. Morris
- Vaughan Pratt
- Robert Boyer
- J Strother Moore
- Michael Rabin
- Richard Karp
Related topics
Seminal works
- knuth1977
- boyer1977
- cormen2009
Frequently asked questions
- Which exact string-matching algorithm is fastest?
- It depends on the input. Knuth-Morris-Pratt guarantees linear worst-case time; Boyer-Moore is often fastest in practice on natural text because it skips characters and runs sublinearly on average; Rabin-Karp shines when searching for many patterns or doing fingerprint comparisons. There is no single universal winner.
- How does Rabin-Karp avoid comparing every character at every position?
- It hashes the pattern and each text window of the same length using a rolling hash that updates in constant time as the window slides. Only when hashes match does it verify the actual characters, so most non-matching positions are rejected by a single cheap comparison.