What is the cluster hypothesis?

The cluster hypothesis states that documents relevant to the same information need tend to be similar to one another. If true, grouping similar documents brings relevant ones together, which can be exploited to improve or organize retrieval results.

How do you evaluate clustering when there are no labels?

Internal measures assess cluster cohesion and separation directly from the data, while external measures compare clusters against a known categorization when one is available. Both are used, since clustering is unsupervised and 'correctness' depends on the intended purpose.

Text Clustering

Text clustering groups documents into clusters of similar content without predefined categories, revealing structure in a collection and supporting browsing and retrieval.

Definition

Text clustering is the unsupervised partitioning of a document collection into groups such that documents within a group are more similar to one another than to documents in other groups, using a similarity measure over document representations and no predefined labels.

Scope

This topic covers unsupervised grouping of documents: flat partitioning methods such as k-means on document vectors, hierarchical agglomerative clustering, the similarity measures and criterion functions involved, and the evaluation of cluster quality both internally and against external labels. It also covers retrieval-specific motivations, notably the cluster hypothesis and search-results clustering. It treats clustering as it serves information retrieval, distinct from supervised classification and from latent topic models.

Core questions

How is similarity between documents measured for clustering?
How do flat methods such as k-means differ from hierarchical agglomerative clustering?
How is the number of clusters chosen?
How is cluster quality evaluated without ground-truth labels?
What does the cluster hypothesis imply for retrieval?

Key concepts

unsupervised clustering
document similarity (cosine)
k-means clustering
hierarchical agglomerative clustering
criterion functions
cluster hypothesis
internal and external cluster evaluation
search-results clustering

Key theories

Cluster hypothesis: Documents that are relevant to the same query tend to be similar to each other, so clustering can group relevant documents together, motivating cluster-based retrieval and results organization.
Flat and hierarchical clustering: Flat methods such as k-means partition documents into a chosen number of clusters by optimizing a criterion function, while hierarchical agglomerative methods build a nested tree of clusters, with criterion choice strongly affecting document-clustering quality.

Clinical relevance

Clustering supports exploring and organizing large document sets: grouping search results by subtopic, deduplicating and organizing news, structuring digital libraries, and providing overviews for exploratory search. The cluster hypothesis also informs retrieval methods that exploit document similarity.

History

Clustering was applied to retrieval early, with van Rijsbergen articulating the cluster hypothesis in the 1970s as a rationale for cluster-based retrieval. As collections grew, scalable methods such as k-means and bisecting variants and careful comparisons of clustering criteria became standard, and results clustering emerged as a way to organize web search output.

Key figures

C. J. van Rijsbergen
George Karypis
Christopher Manning

Seminal works

vanrijsbergen1979
manning2008
zhao2004

Frequently asked questions

What is the cluster hypothesis?: The cluster hypothesis states that documents relevant to the same information need tend to be similar to one another. If true, grouping similar documents brings relevant ones together, which can be exploited to improve or organize retrieval results.
How do you evaluate clustering when there are no labels?: Internal measures assess cluster cohesion and separation directly from the data, while external measures compare clusters against a known categorization when one is available. Both are used, since clustering is unsupervised and 'correctness' depends on the intended purpose.