ScholarGate
Assistent

Cluster Analysis

Cluster analysis groups multivariate observations into clusters so that members of a cluster are more similar to one another than to members of other clusters, without predefined labels.

Troba un tema amb PaperMindAviatFind papers & topics
Tools & resources
Baixa les diapositives
Learn & explore
VídeoAviat

Definition

Cluster analysis is the unsupervised partitioning or hierarchical organization of objects into groups based on a measure of similarity or dissimilarity, with the groups discovered from the data rather than specified in advance.

Scope

This area covers unsupervised grouping of data. It includes hierarchical methods that build a nested tree of clusters, partitioning methods such as k-means that optimize a within-cluster criterion for a fixed number of clusters, and model-based methods that treat clusters as components of a mixture distribution. It also addresses choice of distance, linkage, and the number of clusters, and the validation of clustering solutions.

Sub-topics

Core questions

  • How can natural groupings be discovered in unlabeled multivariate data?
  • How are similarity and dissimilarity defined for the objects?
  • How many clusters are present, and how is that number chosen?
  • How is a clustering solution validated and interpreted?

Key theories

Distance-based grouping
Most clustering methods rest on a dissimilarity measure between objects and a rule, such as a linkage or a within-cluster sum of squares, that turns those dissimilarities into groups.
Mixture-model view of clusters
Model-based clustering regards each cluster as a component of a probability mixture, so that clustering becomes parameter estimation and the number of clusters becomes a model-selection problem.

Clinical relevance

Cluster analysis is used to discover structure in unlabeled data across fields, including market segmentation, taxonomy, gene-expression grouping, image segmentation, and the identification of patient subtypes.

History

Numerical clustering grew out of mid-twentieth-century numerical taxonomy and was systematized into hierarchical and partitioning algorithms. Probabilistic model-based clustering, built on finite mixture models and the expectation-maximization algorithm, later placed the field on a likelihood footing.

Debates

Determining the number of clusters
There is no single agreed method for choosing the number of clusters; criteria range from gap statistics and silhouette widths to information criteria for mixture models, and they can disagree.

Key figures

  • Leonard Kaufman
  • Peter Rousseeuw
  • Brian Everitt

Related topics

Seminal works

  • everitt2011
  • kaufman1990
  • hastie2009

Frequently asked questions

How is clustering different from classification?
Clustering is unsupervised and discovers groups from unlabeled data, whereas classification is supervised and assigns observations to groups that are known and labeled in advance.
Does clustering always find meaningful groups?
No. Clustering algorithms will partition any dataset, so solutions must be validated and interpreted; apparent clusters may reflect the method or distance choice rather than genuine structure.

Methods for this concept

Related concepts