Indexing and Query Processing
Indexing and query processing comprise the data structures and algorithms that let a search system answer queries over large text collections quickly, principally through the inverted index.
Definition
Indexing is the construction of data structures, chiefly the inverted index mapping terms to the documents containing them, that support efficient lookup, while query processing is the set of algorithms that traverse these structures to compute the documents matching or best ranked for a query.
Scope
This area covers how text collections are turned into searchable structures and how queries are evaluated against them: building the inverted index, the tokenization and term-vocabulary decisions behind it, compressing postings to save space and speed access, processing queries efficiently including ranked retrieval and early termination, and tolerant retrieval techniques such as wildcard, spelling-correction, and phonetic matching. It addresses the systems engineering of fast retrieval, distinct from the retrieval models that define ranking and the evaluation methods that measure quality.
Sub-topics
Core questions
- How is an inverted index built and updated for a large, changing collection?
- How can postings lists be compressed without slowing query evaluation?
- How are queries evaluated efficiently, especially ranked queries over millions of documents?
- How can a system retrieve good results without scoring every document?
- How does a system handle misspellings, wildcards, and approximate matches?
Key concepts
- inverted index
- postings list
- tokenization and term vocabulary
- index construction (BSBI, SPIMI)
- index compression
- document-at-a-time and term-at-a-time evaluation
- dynamic pruning and early termination
- tolerant retrieval
Key theories
- Inverted index as the core data structure
- Mapping each term to a posting list of the documents (and positions) where it occurs lets retrieval touch only documents containing query terms, making it the foundational structure for scalable text search.
- Compression-efficiency trade-off
- Encoding document-id gaps and term frequencies with compact integer codes shrinks the index dramatically and, by reducing input/output and improving cache behavior, can also speed query processing.
- Efficient ranked query evaluation
- Document-at-a-time and term-at-a-time strategies, combined with dynamic pruning and early-termination techniques, allow systems to return the top-ranked results without fully scoring the entire collection.
Clinical relevance
Inverted indexes and efficient query processing are the engine room of every production search system, from web search engines and open-source search platforms to enterprise and database full-text search. Their efficiency directly determines query latency, hardware cost, and the scale of collections that can be searched interactively.
History
Inverted files have been used for text search since the earliest information systems, but the modern theory of index construction, compression, and efficient evaluation was consolidated in the 1990s, notably by the Managing Gigabytes work of Witten, Moffat, and Bell. Zobel and Moffat's 2006 survey synthesized two decades of inverted-index research as web-scale search made efficiency paramount.
Key figures
- Justin Zobel
- Alistair Moffat
- Ian H. Witten
- W. Bruce Croft
Related topics
Seminal works
- zobel2006
- wittenmgb1999
- manning2008
Frequently asked questions
- Why is the inverted index preferred over scanning documents?
- Scanning every document for each query is far too slow at scale. The inverted index lets the system jump straight to the small set of documents that contain the query terms, so query time depends on the postings lists involved rather than the size of the whole collection.
- Does compressing the index slow down search?
- Usually the opposite. A smaller index reduces disk and memory traffic, and modern integer codes decompress very fast, so the time saved on input/output and improved cache behavior typically outweighs the decoding cost, making compressed indexes both smaller and faster.