Cluster Analysis
Cluster analysis groups multivariate observations into clusters so that members of a cluster are more similar to one another than to members of other clusters, without predefined labels.
Definition
Cluster analysis is the unsupervised partitioning or hierarchical organization of objects into groups based on a measure of similarity or dissimilarity, with the groups discovered from the data rather than specified in advance.
Scope
This area covers unsupervised grouping of data. It includes hierarchical methods that build a nested tree of clusters, partitioning methods such as k-means that optimize a within-cluster criterion for a fixed number of clusters, and model-based methods that treat clusters as components of a mixture distribution. It also addresses choice of distance, linkage, and the number of clusters, and the validation of clustering solutions.
Sub-topics
Core questions
- How can natural groupings be discovered in unlabeled multivariate data?
- How are similarity and dissimilarity defined for the objects?
- How many clusters are present, and how is that number chosen?
- How is a clustering solution validated and interpreted?
Key theories
- Distance-based grouping
- Most clustering methods rest on a dissimilarity measure between objects and a rule, such as a linkage or a within-cluster sum of squares, that turns those dissimilarities into groups.
- Mixture-model view of clusters
- Model-based clustering regards each cluster as a component of a probability mixture, so that clustering becomes parameter estimation and the number of clusters becomes a model-selection problem.
Clinical relevance
Cluster analysis is used to discover structure in unlabeled data across fields, including market segmentation, taxonomy, gene-expression grouping, image segmentation, and the identification of patient subtypes.
History
Numerical clustering grew out of mid-twentieth-century numerical taxonomy and was systematized into hierarchical and partitioning algorithms. Probabilistic model-based clustering, built on finite mixture models and the expectation-maximization algorithm, later placed the field on a likelihood footing.
Debates
- Determining the number of clusters
- There is no single agreed method for choosing the number of clusters; criteria range from gap statistics and silhouette widths to information criteria for mixture models, and they can disagree.
Key figures
- Leonard Kaufman
- Peter Rousseeuw
- Brian Everitt
Related topics
Seminal works
- everitt2011
- kaufman1990
- hastie2009
Frequently asked questions
- How is clustering different from classification?
- Clustering is unsupervised and discovers groups from unlabeled data, whereas classification is supervised and assigns observations to groups that are known and labeled in advance.
- Does clustering always find meaningful groups?
- No. Clustering algorithms will partition any dataset, so solutions must be validated and interpreted; apparent clusters may reflect the method or distance choice rather than genuine structure.