Document Representation and Weighting
Document representation turns raw text into a structured set of weighted features, deciding what counts as a term and how much each term should contribute.
Definition
Document representation and weighting is the process of transforming raw document text into a vector of features, typically terms, by tokenizing and normalizing text and assigning each feature a weight that reflects its importance within the document and across the collection.
Scope
This topic covers the steps that convert documents into searchable representations: tokenization, normalization, stop-word handling, stemming and lemmatization, and the construction of bag-of-words or n-gram feature vectors, together with term-weighting schemes such as raw and logarithmic term frequency, inverse document frequency, and tf-idf with length normalization. It treats the choices that shape the representation feeding retrieval, classification, and clustering, while leaving the ranking models and latent representations to adjacent topics.
Core questions
- How is raw text tokenized and normalized into terms?
- What is the effect of stop-word removal, stemming, and lemmatization?
- Why does term frequency alone make a poor weight, and how is it transformed?
- How does inverse document frequency capture term importance across a collection?
- How does length normalization keep long and short documents comparable?
Key concepts
- tokenization and normalization
- stop words
- stemming and lemmatization
- bag-of-words and n-grams
- term frequency (raw and log)
- inverse document frequency
- tf-idf variants
- length normalization
Key theories
- Bag-of-words representation
- Treating a document as an unordered multiset of terms, ignoring word order, yields a simple, effective feature vector that underpins classical retrieval, classification, and clustering despite discarding syntax.
- tf-idf weighting schemes
- Combining a (often dampened) term-frequency component with inverse document frequency and length normalization produces weights that emphasize terms frequent in a document but rare in the collection, with many documented variants.
Clinical relevance
Representation and weighting choices directly affect the quality of every downstream task, from search ranking to spam filtering and clustering. tf-idf representations remain a strong, interpretable baseline, and the same design questions of tokenization and normalization persist in modern pipelines that feed learned embeddings.
History
Document representation matured alongside the vector space model in the 1960s and 1970s, with Spärck Jones introducing inverse document frequency in 1972 and Salton and Buckley systematizing term-weighting variants in 1988. The bag-of-words representation and tf-idf became the default substrate for text processing across IR and machine learning for decades.
Key figures
- Gerard Salton
- Chris Buckley
- Karen Spärck Jones
Related topics
Seminal works
- salton1988
- sparckjones1972
- manning2008
Frequently asked questions
- What is the bag-of-words model?
- The bag-of-words model represents a document as the set or multiset of terms it contains, ignoring word order and grammar. Despite discarding sequence information, it is simple, efficient, and surprisingly effective for retrieval, classification, and clustering.
- Why apply a logarithm to term frequency?
- A term appearing ten times is not ten times as important as one appearing once. Taking a logarithm of term frequency dampens this effect, so additional occurrences add progressively less weight, better reflecting how repetition relates to relevance.