How is clustering different from classification?

Clustering is unsupervised and discovers groups from unlabeled data, whereas classification is supervised and assigns observations to groups that are known and labeled in advance.

Does clustering always find meaningful groups?

No. Clustering algorithms will partition any dataset, so solutions must be validated and interpreted; apparent clusters may reflect the method or distance choice rather than genuine structure.

Cluster Analysis

Cluster analysis groups multivariate observations into clusters so that members of a cluster are more similar to one another than to members of other clusters, without predefined labels.

Find emne med PaperMindSnartFind papers & topics

Tools & resources

Hent slides

Learn & explore

VideoSnart

Definition

Cluster analysis is the unsupervised partitioning or hierarchical organization of objects into groups based on a measure of similarity or dissimilarity, with the groups discovered from the data rather than specified in advance.

Scope

This area covers unsupervised grouping of data. It includes hierarchical methods that build a nested tree of clusters, partitioning methods such as k-means that optimize a within-cluster criterion for a fixed number of clusters, and model-based methods that treat clusters as components of a mixture distribution. It also addresses choice of distance, linkage, and the number of clusters, and the validation of clustering solutions.

Sub-topics

Core questions

How can natural groupings be discovered in unlabeled multivariate data?
How are similarity and dissimilarity defined for the objects?
How many clusters are present, and how is that number chosen?
How is a clustering solution validated and interpreted?

Key theories

Distance-based grouping: Most clustering methods rest on a dissimilarity measure between objects and a rule, such as a linkage or a within-cluster sum of squares, that turns those dissimilarities into groups.
Mixture-model view of clusters: Model-based clustering regards each cluster as a component of a probability mixture, so that clustering becomes parameter estimation and the number of clusters becomes a model-selection problem.

Clinical relevance

Cluster analysis is used to discover structure in unlabeled data across fields, including market segmentation, taxonomy, gene-expression grouping, image segmentation, and the identification of patient subtypes.

History

Numerical clustering grew out of mid-twentieth-century numerical taxonomy and was systematized into hierarchical and partitioning algorithms. Probabilistic model-based clustering, built on finite mixture models and the expectation-maximization algorithm, later placed the field on a likelihood footing.

Debates

Determining the number of clusters: There is no single agreed method for choosing the number of clusters; criteria range from gap statistics and silhouette widths to information criteria for mixture models, and they can disagree.

Key figures

Leonard Kaufman
Peter Rousseeuw
Brian Everitt

Seminal works

everitt2011
kaufman1990
hastie2009

Frequently asked questions

How is clustering different from classification?: Clustering is unsupervised and discovers groups from unlabeled data, whereas classification is supervised and assigns observations to groups that are known and labeled in advance.
Does clustering always find meaningful groups?: No. Clustering algorithms will partition any dataset, so solutions must be validated and interpreted; apparent clusters may reflect the method or distance choice rather than genuine structure.