What is the bag-of-words model?

The bag-of-words model represents a document as the set or multiset of terms it contains, ignoring word order and grammar. Despite discarding sequence information, it is simple, efficient, and surprisingly effective for retrieval, classification, and clustering.

Why apply a logarithm to term frequency?

A term appearing ten times is not ten times as important as one appearing once. Taking a logarithm of term frequency dampens this effect, so additional occurrences add progressively less weight, better reflecting how repetition relates to relevance.

Document Representation and Weighting

Document representation turns raw text into a structured set of weighted features, deciding what counts as a term and how much each term should contribute.

Definition

Document representation and weighting is the process of transforming raw document text into a vector of features, typically terms, by tokenizing and normalizing text and assigning each feature a weight that reflects its importance within the document and across the collection.

Scope

This topic covers the steps that convert documents into searchable representations: tokenization, normalization, stop-word handling, stemming and lemmatization, and the construction of bag-of-words or n-gram feature vectors, together with term-weighting schemes such as raw and logarithmic term frequency, inverse document frequency, and tf-idf with length normalization. It treats the choices that shape the representation feeding retrieval, classification, and clustering, while leaving the ranking models and latent representations to adjacent topics.

Core questions

How is raw text tokenized and normalized into terms?
What is the effect of stop-word removal, stemming, and lemmatization?
Why does term frequency alone make a poor weight, and how is it transformed?
How does inverse document frequency capture term importance across a collection?
How does length normalization keep long and short documents comparable?

Key concepts

tokenization and normalization
stop words
stemming and lemmatization
bag-of-words and n-grams
term frequency (raw and log)
inverse document frequency
tf-idf variants
length normalization

Key theories

Bag-of-words representation: Treating a document as an unordered multiset of terms, ignoring word order, yields a simple, effective feature vector that underpins classical retrieval, classification, and clustering despite discarding syntax.
tf-idf weighting schemes: Combining a (often dampened) term-frequency component with inverse document frequency and length normalization produces weights that emphasize terms frequent in a document but rare in the collection, with many documented variants.

Clinical relevance

Representation and weighting choices directly affect the quality of every downstream task, from search ranking to spam filtering and clustering. tf-idf representations remain a strong, interpretable baseline, and the same design questions of tokenization and normalization persist in modern pipelines that feed learned embeddings.

History

Document representation matured alongside the vector space model in the 1960s and 1970s, with Spärck Jones introducing inverse document frequency in 1972 and Salton and Buckley systematizing term-weighting variants in 1988. The bag-of-words representation and tf-idf became the default substrate for text processing across IR and machine learning for decades.

Key figures

Gerard Salton
Chris Buckley
Karen Spärck Jones

Seminal works

salton1988
sparckjones1972
manning2008

Frequently asked questions

What is the bag-of-words model?: The bag-of-words model represents a document as the set or multiset of terms it contains, ignoring word order and grammar. Despite discarding sequence information, it is simple, efficient, and surprisingly effective for retrieval, classification, and clustering.
Why apply a logarithm to term frequency?: A term appearing ten times is not ten times as important as one appearing once. Taking a logarithm of term frequency dampens this effect, so additional occurrences add progressively less weight, better reflecting how repetition relates to relevance.