How does model-based clustering differ from k-means?

K-means makes hard assignments minimizing squared distance and implicitly assumes spherical clusters, whereas model-based clustering fits a probability mixture, gives soft memberships, and can model clusters of different shapes, sizes, and orientations.

What does the EM algorithm do here?

It iteratively estimates the probability that each observation belongs to each cluster and then updates the cluster distributions, repeating until the mixture likelihood stabilizes.

Model-Based Clustering

Model-based clustering treats the data as arising from a finite mixture of probability distributions, with each component representing a cluster, and estimates the model by maximum likelihood.

Definition

Model-based clustering is an approach that models the population as a mixture of component distributions, assigns each observation a posterior probability of belonging to each component, and thereby derives clusters as the estimated mixture components.

Scope

This topic covers finite mixture models, most commonly Gaussian mixtures, the expectation-maximization algorithm for estimating mixture parameters and posterior cluster memberships, parameterizations of component covariances that control cluster shape and orientation, and the use of information criteria to select the number of components.

Core questions

How can clustering be framed as a statistical estimation problem?
How are mixture parameters and soft cluster memberships estimated?
How do covariance parameterizations control the geometry of clusters?
How is the number of mixture components selected?

Key theories

Finite mixture formulation: Each observation is assumed drawn from one of several component distributions with unknown mixing proportions, so clustering reduces to estimating the components and assigning posterior membership probabilities.
Expectation-maximization estimation: Treating cluster labels as missing data, the EM algorithm alternates between computing expected memberships and re-estimating component parameters, converging to a maximum-likelihood fit of the mixture.

Clinical relevance

Model-based clustering provides probabilistic cluster assignments and principled model selection, and is applied in density estimation, classification of subpopulations, and settings where overlapping or differently shaped clusters require a statistical model.

History

Finite mixture models have a long statistical history, but their use as a clustering framework expanded with the expectation-maximization algorithm and with covariance parameterizations and model-selection criteria that made Gaussian mixture clustering practical and widely available.

Debates

Selecting the number of components: Information criteria such as the Bayesian information criterion are commonly used to choose the number of mixture components, but likelihood-based selection can be sensitive to model assumptions and to overlapping components.

Key figures

Geoffrey McLachlan
Adrian Raftery
Chris Fraley

Seminal works

mclachlan2000
fraley2002
hastie2009

Frequently asked questions

How does model-based clustering differ from k-means?: K-means makes hard assignments minimizing squared distance and implicitly assumes spherical clusters, whereas model-based clustering fits a probability mixture, gives soft memberships, and can model clusters of different shapes, sizes, and orientations.
What does the EM algorithm do here?: It iteratively estimates the probability that each observation belongs to each cluster and then updates the cluster distributions, repeating until the mixture likelihood stabilizes.