Model-Based Clustering
Model-based clustering treats the data as arising from a finite mixture of probability distributions, with each component representing a cluster, and estimates the model by maximum likelihood.
Definition
Model-based clustering is an approach that models the population as a mixture of component distributions, assigns each observation a posterior probability of belonging to each component, and thereby derives clusters as the estimated mixture components.
Scope
This topic covers finite mixture models, most commonly Gaussian mixtures, the expectation-maximization algorithm for estimating mixture parameters and posterior cluster memberships, parameterizations of component covariances that control cluster shape and orientation, and the use of information criteria to select the number of components.
Core questions
- How can clustering be framed as a statistical estimation problem?
- How are mixture parameters and soft cluster memberships estimated?
- How do covariance parameterizations control the geometry of clusters?
- How is the number of mixture components selected?
Key theories
- Finite mixture formulation
- Each observation is assumed drawn from one of several component distributions with unknown mixing proportions, so clustering reduces to estimating the components and assigning posterior membership probabilities.
- Expectation-maximization estimation
- Treating cluster labels as missing data, the EM algorithm alternates between computing expected memberships and re-estimating component parameters, converging to a maximum-likelihood fit of the mixture.
Clinical relevance
Model-based clustering provides probabilistic cluster assignments and principled model selection, and is applied in density estimation, classification of subpopulations, and settings where overlapping or differently shaped clusters require a statistical model.
History
Finite mixture models have a long statistical history, but their use as a clustering framework expanded with the expectation-maximization algorithm and with covariance parameterizations and model-selection criteria that made Gaussian mixture clustering practical and widely available.
Debates
- Selecting the number of components
- Information criteria such as the Bayesian information criterion are commonly used to choose the number of mixture components, but likelihood-based selection can be sensitive to model assumptions and to overlapping components.
Key figures
- Geoffrey McLachlan
- Adrian Raftery
- Chris Fraley
Related topics
Seminal works
- mclachlan2000
- fraley2002
- hastie2009
Frequently asked questions
- How does model-based clustering differ from k-means?
- K-means makes hard assignments minimizing squared distance and implicitly assumes spherical clusters, whereas model-based clustering fits a probability mixture, gives soft memberships, and can model clusters of different shapes, sizes, and orientations.
- What does the EM algorithm do here?
- It iteratively estimates the probability that each observation belongs to each cluster and then updates the cluster distributions, repeating until the mixture likelihood stabilizes.