Index Compression
Index compression encodes the postings lists of an inverted index compactly so that a search system stores less data and answers queries faster.
Definition
Index compression is the application of integer- and string-coding methods to the dictionary and postings of an inverted index so as to reduce its storage footprint while keeping postings rapidly decodable during query processing.
Scope
This topic covers techniques for compressing inverted indexes, especially the encoding of document-identifier gaps and term frequencies with variable-length and word-aligned integer codes. It treats dictionary compression, gap (delta) encoding, classic codes such as unary, gamma, and Golomb-Rice, byte-aligned and block-based schemes such as variable-byte and PForDelta, and the trade-off between compression ratio and decoding speed. It excludes the construction of the index itself and the query-evaluation strategy that consumes it.
Core questions
- Why does encoding gaps between document identifiers compress postings effectively?
- What integer codes are used, and how do they trade compression ratio against decoding speed?
- How is the term dictionary itself compressed?
- How can compressed postings be decoded fast enough to keep query latency low?
- How does compression interact with cache behavior and input/output cost?
Key concepts
- gap (delta) encoding
- variable-byte encoding
- gamma and Golomb-Rice codes
- PForDelta and block-based codes
- dictionary compression
- compression ratio
- decoding throughput
- SIMD / vectorized decoding
Key theories
- Gap encoding of postings
- Because document identifiers in a postings list are increasing, storing the differences (gaps) between consecutive identifiers yields small numbers that compress well, especially for frequent terms with dense postings.
- Compression-speed trade-off
- Bit-aligned codes such as gamma and Golomb maximize compression but decode slowly, whereas byte-aligned and block-based codes such as variable-byte and PForDelta sacrifice some ratio for much faster, vectorizable decoding, which often wins overall query performance.
Clinical relevance
Compression is essential to operating search at scale: it shrinks indexes so they fit in memory or smaller storage, reduces input/output and improves cache locality, and thereby lowers both query latency and hardware cost. Production search engines and open-source search libraries all rely on compressed postings.
History
Compact coding of text indexes was developed alongside inverted files, with classic bit-aligned codes (unary, gamma, Golomb) systematized in the Managing Gigabytes work of the 1990s. As web-scale search demanded ever-faster decoding, byte-aligned and block-based schemes such as variable-byte and PForDelta, and later vectorized decoders capable of billions of integers per second, shifted the emphasis toward speed.
Key figures
- Alistair Moffat
- Ian H. Witten
- Daniel Lemire
- Justin Zobel
Related topics
Seminal works
- wittenmgb1999
- lemire2015
- manning2008
Frequently asked questions
- How can a compressed index be faster than an uncompressed one?
- Compression reduces the amount of data read from disk or memory, which is often the bottleneck. Modern integer codes decode very quickly, frequently using vector instructions, so the time saved on input/output and the better cache behavior more than compensate for the decoding work.
- Why store gaps instead of raw document identifiers?
- Document identifiers in a postings list are sorted and increasing, so consecutive ones differ by small amounts. Storing those small gaps instead of large absolute identifiers produces values that compact codes can represent in very few bits.