Inverted Indexes
An inverted index maps each term in a collection to a postings list of the documents that contain it, enabling a search system to find matching documents without scanning every document.
Definition
An inverted index is a data structure consisting of a dictionary of indexed terms, each pointing to a postings list that enumerates the documents containing the term, often annotated with frequencies and term positions, so that retrieval can be performed by intersecting or merging postings lists.
Scope
This topic covers the structure and construction of the inverted index: the dictionary of terms, the postings lists recording document identifiers, term frequencies, and positions, and the algorithms that build and update indexes over large collections, including blocked sort-based indexing and single-pass in-memory indexing. It addresses positional information for phrase queries and the engineering of index maintenance, while leaving compression and query-evaluation strategy to adjacent topics.
Core questions
- What does a dictionary entry and its postings list contain?
- How are positions stored to support phrase and proximity queries?
- How is an inverted index built when the collection is too large for memory?
- How is an index updated as documents are added, changed, or deleted?
- How do postings lists support efficient intersection for conjunctive queries?
Key concepts
- term dictionary
- postings list
- document identifiers
- positional index
- term frequency storage
- blocked sort-based indexing (BSBI)
- single-pass in-memory indexing (SPIMI)
- index merging and updates
Key theories
- Dictionary and postings organization
- Separating a compact term dictionary from variable-length postings lists lets the system look up a term quickly and then stream only the relevant documents, which is the structural basis of all inverted-index retrieval.
- Scalable index construction
- Disk-based methods such as blocked sort-based indexing and single-pass in-memory indexing build inverted files for collections far larger than memory by accumulating and merging partial indexes.
Clinical relevance
The inverted index is the central data structure of virtually all text search systems, including web search engines, open-source search platforms such as Lucene and its derivatives, and database full-text search. Its design governs which query types are supported and how quickly and cheaply they can be answered.
History
Inverted files were used in early bibliographic retrieval systems and became the standard structure for full-text search as collections grew. Research in the 1990s and 2000s, including scalable construction methods such as single-pass in-memory indexing, made it practical to index web-scale corpora, and the structure now anchors widely used open-source search libraries.
Key figures
- Justin Zobel
- Alistair Moffat
- Steffen Heinz
Related topics
Seminal works
- zobel2006
- heinz2003
- manning2008
Frequently asked questions
- Why is it called an 'inverted' index?
- A normal (forward) index lists, for each document, the terms it contains. The inverted index reverses this mapping to list, for each term, the documents that contain it. This inversion is exactly what makes term-based lookup fast.
- What is a positional index used for?
- A positional index stores the positions at which each term occurs within each document. This lets the system answer phrase queries and proximity queries, where the order or closeness of terms matters, rather than only whether the terms appear somewhere in the document.