How does a search engine handle a wildcard like 'comput*'?

It uses an auxiliary dictionary structure, such as a permuterm or k-gram index, to find all terms matching the pattern (computer, computing, computation, and so on), then evaluates the original query as if those terms had been listed explicitly.

What is edit distance and why is it used for spelling correction?

Edit distance counts the minimum single-character insertions, deletions, and substitutions needed to turn one word into another. A small edit distance between a misspelled query term and a dictionary term suggests the dictionary term is a likely intended correction.

Tolerant and Wildcard Retrieval

Tolerant retrieval lets a search system match queries despite spelling variation, wildcards, and phonetic differences, so that users still find relevant documents when the query and text do not match exactly.

Najít téma v PaperMindJiž brzyFind papers & topics

Tools & resources

Stáhnout prezentaci

Learn & explore

VideoJiž brzy

Definition

Tolerant retrieval comprises dictionary-level techniques that match query terms to indexed terms despite incomplete, misspelled, or phonetically varying input, including wildcard expansion, edit-distance-based spelling correction, and phonetic encoding.

Scope

This topic covers techniques that relax exact term matching at the dictionary level: wildcard query processing using permuterm and k-gram indexes, spelling correction by edit distance and context, and phonetic matching such as Soundex. It treats how the term dictionary is augmented to support these approximate lookups and how candidate terms are generated and ranked, distinct from semantic matching, which addresses meaning rather than surface form.

Core questions

How are wildcard queries such as prefix, suffix, and infix patterns evaluated against the dictionary?
How do permuterm and k-gram indexes support wildcard lookups?
How is the closest correctly spelled term found for a misspelled query term?
How does edit (Levenshtein) distance quantify the difference between two strings?
How does phonetic matching such as Soundex group terms that sound alike?

Key concepts

wildcard query
permuterm index
k-gram index
edit (Levenshtein) distance
spelling correction
phonetic matching (Soundex)
approximate string matching
candidate term generation

Key theories

Wildcard indexing with permuterm and k-gram indexes: Rotating terms so a wildcard always falls at the end (permuterm) or indexing terms by their character k-grams lets the system convert a wildcard pattern into ordinary dictionary lookups that retrieve candidate terms.
Edit-distance spelling correction: The minimum number of single-character insertions, deletions, and substitutions needed to transform one string into another (edit distance) provides a principled measure for proposing correctly spelled alternatives to a query term, often combined with term frequency and context.

Clinical relevance

Tolerant retrieval powers everyday search affordances: 'did you mean' spelling suggestions, autocomplete and prefix search, and forgiving matching of names and product terms. It substantially improves recall and user experience when queries contain typos or when users do not know exact spellings.

History

Approximate matching and spelling correction have long histories in computing, with Soundex dating to early twentieth-century records indexing. Kukich's 1992 survey consolidated automatic spelling-correction techniques, and Navarro's 2001 survey systematized approximate string matching. These methods became standard components of search dictionaries as web search made forgiving query handling essential.

Key figures

Karen Kukich
Gonzalo Navarro

Seminal works

manning2008
kukich1992
navarro2001

Frequently asked questions

How does a search engine handle a wildcard like 'comput*'?: It uses an auxiliary dictionary structure, such as a permuterm or k-gram index, to find all terms matching the pattern (computer, computing, computation, and so on), then evaluates the original query as if those terms had been listed explicitly.
What is edit distance and why is it used for spelling correction?: Edit distance counts the minimum single-character insertions, deletions, and substitutions needed to turn one word into another. A small edit distance between a misspelled query term and a dictionary term suggests the dictionary term is a likely intended correction.