ScholarGate
Βοηθός

Vector Space Model

The vector space model represents documents and queries as vectors of term weights in a high-dimensional space and ranks documents by their geometric similarity to the query.

Εύρεση θέματος με το PaperMindΣύντομαFind papers & topics
Tools & resources
Λήψη διαφανειών
Learn & explore
ΒίντεοΣύντομα

Definition

The vector space model embeds documents and queries as vectors whose components are term weights, and estimates relevance by a vector similarity measure, most commonly the cosine of the angle between the document and query vectors after length normalization.

Scope

This topic covers the algebraic model of retrieval in which each term defines a dimension and documents and queries become weighted vectors. It addresses term weighting schemes, especially term frequency, inverse document frequency, and their tf-idf product, length normalization, and the cosine similarity used to score documents. It treats the geometric intuition of relevance as proximity in term space and the practical scoring of ranked retrieval, while leaving the probabilistic justification of weights to the probabilistic models topic.

Core questions

  • How are documents and queries turned into vectors over a shared term vocabulary?
  • Why does combining term frequency with inverse document frequency produce useful weights?
  • How does cosine similarity measure closeness while controlling for document length?
  • What does it mean geometrically for a document to be relevant to a query?
  • What are the limitations of treating terms as independent orthogonal dimensions?

Key concepts

  • term-document vector
  • term frequency (tf)
  • inverse document frequency (idf)
  • tf-idf weighting
  • cosine similarity
  • length normalization
  • high-dimensional term space
  • bag-of-words assumption

Key theories

Vector representation and cosine similarity
Representing documents and queries as vectors in term space allows relevance to be estimated by the cosine of the angle between them, which normalizes for length and rewards documents whose term distribution aligns with the query.
tf-idf term weighting
A term's weight grows with its frequency in a document but is dampened by how common the term is across the collection, captured by inverse document frequency, so that discriminating terms dominate the score.

Clinical relevance

The vector space model and tf-idf weighting underpin a vast range of search and text-analysis systems and remain a default scoring baseline. The same vector representation is the conceptual ancestor of modern dense embedding retrieval, where learned vectors replace hand-crafted term weights.

History

Salton introduced vector-based indexing through the SMART system, formalized in the 1975 paper with Wong and Yang. Spärck Jones's 1972 statistical interpretation of term specificity supplied the inverse document frequency component, and Salton and Buckley's 1988 study systematized tf-idf weighting variants. The model dominated experimental IR for decades and shaped how text is represented numerically across computing.

Key figures

  • Gerard Salton
  • Karen Spärck Jones
  • Chris Buckley

Related topics

Seminal works

  • salton1975
  • sparckjones1972
  • salton1988

Frequently asked questions

Why use cosine similarity rather than Euclidean distance?
Cosine similarity compares the direction of the document and query vectors rather than their magnitude, which makes it robust to document length: a long document and a short one on the same topic can still score highly, whereas raw distance would penalize the longer one.
What does inverse document frequency accomplish?
Inverse document frequency downweights terms that appear in many documents, such as common words, and boosts rare, discriminating terms. This prevents ubiquitous words from dominating similarity scores and focuses matching on content-bearing terms.

Methods for this concept

Related concepts