ScholarGate
Pembantu

Retrieval Models

Retrieval models are the formal frameworks that define what it means for a document to match a query and how documents are scored and ranked in response to an information need.

Cari Topik dengan PaperMindTidak lama lagiFind papers & topics
Tools & resources
Muat turun slaid
Learn & explore
VideoTidak lama lagi

Definition

A retrieval model is a precise specification of document and query representations together with a ranking or matching function that, given a query, assigns each document a score reflecting its estimated relevance to the underlying information need.

Scope

This area covers the principal mathematical models used to match queries against documents and to rank results: set-theoretic Boolean and extended Boolean retrieval, the algebraic vector space model with term weighting such as tf-idf, probabilistic models including the binary independence model and BM25, and statistical language models for retrieval. It treats how relevance is formalized, how term weights are assigned, and how a similarity or probability score induces a ranking. It excludes the data structures that make retrieval efficient (covered under indexing and query processing) and the empirical measurement of how well a model performs (covered under evaluation).

Sub-topics

Core questions

  • What formal representation of documents and queries does a model assume?
  • How does a model translate a representation into a relevance score or a matching decision?
  • How are individual terms weighted to reflect their importance within a document and across a collection?
  • How does a model account for the uncertainty inherent in relevance?
  • What assumptions (such as term independence) does a model make, and when do they break down?

Key concepts

  • relevance
  • term weighting and tf-idf
  • Boolean retrieval
  • vector space and cosine similarity
  • probability ranking principle
  • binary independence model and BM25
  • query likelihood and smoothing
  • term independence assumption
  • ranking function

Key theories

Vector space model
Documents and queries are represented as vectors in a high-dimensional term space, typically with tf-idf weights, and relevance is estimated by a geometric similarity such as the cosine of the angle between the query and document vectors.
Probability ranking principle and probabilistic retrieval
Ranking documents by their estimated probability of relevance to a query optimizes retrieval effectiveness under stated assumptions; the binary independence model and its practical descendant BM25 operationalize this with term-weighting derived from relevance probabilities.
Language modeling approach to retrieval
Each document is treated as a sample from a generative language model, and documents are ranked by the probability that their model would have generated the query, with smoothing used to handle unseen query terms.

Clinical relevance

Retrieval models are the scoring core of essentially every search system, from library catalogs and enterprise search to web search engines and the candidate-ranking stages of question answering and retrieval-augmented generation. tf-idf and BM25 in particular remain strong, widely deployed baselines.

History

The vector space model emerged from Salton's SMART project in the 1960s and 1970s, giving retrieval an algebraic foundation. In parallel, Robertson and Spärck Jones developed a probabilistic theory of relevance weighting in the 1970s, which later matured into the BM25 ranking function. The language modeling approach, introduced by Ponte and Croft in 1998, reframed retrieval as statistical generation and broadened the modeling toolkit.

Key figures

  • Gerard Salton
  • Stephen E. Robertson
  • Karen Spärck Jones
  • W. Bruce Croft
  • C. J. van Rijsbergen

Related topics

Seminal works

  • salton1975
  • robertson1976
  • ponte1998
  • manning2008

Frequently asked questions

What is the difference between a retrieval model and a ranking function?
A retrieval model is the overall framework that specifies how documents and queries are represented and how relevance is conceived; the ranking function is the concrete scoring formula the model produces, such as cosine similarity in the vector space model or the BM25 formula in the probabilistic family.
Why is BM25 still used when neural models exist?
BM25 is fast, requires no training data, has very few parameters, and remains a strong baseline that neural rankers are often measured against and combined with. Many modern systems use BM25 to retrieve an initial candidate set that a more expensive model then re-ranks.

Methods for this concept

Related concepts