Why use cosine similarity rather than Euclidean distance?

Cosine similarity compares the direction of the document and query vectors rather than their magnitude, which makes it robust to document length: a long document and a short one on the same topic can still score highly, whereas raw distance would penalize the longer one.

What does inverse document frequency accomplish?

Inverse document frequency downweights terms that appear in many documents, such as common words, and boosts rare, discriminating terms. This prevents ubiquitous words from dominating similarity scores and focuses matching on content-bearing terms.

Vector Space Model

The vector space model represents documents and queries as vectors of term weights in a high-dimensional space and ranks documents by their geometric similarity to the query.

Εύρεση θέματος με το PaperMindΣύντομαFind papers & topics

Tools & resources

Λήψη διαφανειών

Learn & explore

ΒίντεοΣύντομα

Definition

The vector space model embeds documents and queries as vectors whose components are term weights, and estimates relevance by a vector similarity measure, most commonly the cosine of the angle between the document and query vectors after length normalization.

Scope

This topic covers the algebraic model of retrieval in which each term defines a dimension and documents and queries become weighted vectors. It addresses term weighting schemes, especially term frequency, inverse document frequency, and their tf-idf product, length normalization, and the cosine similarity used to score documents. It treats the geometric intuition of relevance as proximity in term space and the practical scoring of ranked retrieval, while leaving the probabilistic justification of weights to the probabilistic models topic.

Core questions

How are documents and queries turned into vectors over a shared term vocabulary?
Why does combining term frequency with inverse document frequency produce useful weights?
How does cosine similarity measure closeness while controlling for document length?
What does it mean geometrically for a document to be relevant to a query?
What are the limitations of treating terms as independent orthogonal dimensions?

Key concepts

term-document vector
term frequency (tf)
inverse document frequency (idf)
tf-idf weighting
cosine similarity
length normalization
high-dimensional term space
bag-of-words assumption

Key theories

Vector representation and cosine similarity: Representing documents and queries as vectors in term space allows relevance to be estimated by the cosine of the angle between them, which normalizes for length and rewards documents whose term distribution aligns with the query.
tf-idf term weighting: A term's weight grows with its frequency in a document but is dampened by how common the term is across the collection, captured by inverse document frequency, so that discriminating terms dominate the score.

Clinical relevance

The vector space model and tf-idf weighting underpin a vast range of search and text-analysis systems and remain a default scoring baseline. The same vector representation is the conceptual ancestor of modern dense embedding retrieval, where learned vectors replace hand-crafted term weights.

History

Salton introduced vector-based indexing through the SMART system, formalized in the 1975 paper with Wong and Yang. Spärck Jones's 1972 statistical interpretation of term specificity supplied the inverse document frequency component, and Salton and Buckley's 1988 study systematized tf-idf weighting variants. The model dominated experimental IR for decades and shaped how text is represented numerically across computing.

Key figures

Gerard Salton
Karen Spärck Jones
Chris Buckley

Seminal works

salton1975
sparckjones1972
salton1988

Frequently asked questions

Why use cosine similarity rather than Euclidean distance?: Cosine similarity compares the direction of the document and query vectors rather than their magnitude, which makes it robust to document length: a long document and a short one on the same topic can still score highly, whereas raw distance would penalize the longer one.
What does inverse document frequency accomplish?: Inverse document frequency downweights terms that appear in many documents, such as common words, and boosts rare, discriminating terms. This prevents ubiquitous words from dominating similarity scores and focuses matching on content-bearing terms.