Dirichlet Process and Mixture Models
The Dirichlet process is a prior over distributions whose discreteness makes it a natural basis for mixture models that infer the number of clusters from the data.
Definition
The Dirichlet process is a stochastic process whose realizations are probability measures; a Dirichlet process mixture model convolves these discrete random measures with a kernel, yielding a mixture with a random, data-determined number of components.
Scope
This topic covers the Dirichlet process and its concentration parameter and base measure, the Polya urn and Chinese restaurant process representations, the clustering they induce, and the Dirichlet process mixture model used for density estimation and clustering with an unbounded number of components.
Core questions
- What are the concentration parameter and base measure of a Dirichlet process?
- How do the Polya urn and Chinese restaurant process describe its clustering?
- How does a Dirichlet process mixture infer the number of clusters?
- How is posterior inference for these models carried out?
Key concepts
- Dirichlet process
- concentration parameter
- base measure
- Chinese restaurant process
- Polya urn scheme
- infinite mixture model
- clustering
Key theories
- Dirichlet process
- Ferguson defined the Dirichlet process so that its values on any finite partition are Dirichlet-distributed, giving a conjugate, almost-surely discrete prior over distributions.
- Dirichlet process mixtures
- Mixing a continuous kernel over a Dirichlet-process-distributed measure yields flexible density estimates and clustering with an unbounded number of components, with inference via Gibbs sampling.
Clinical relevance
Dirichlet process mixtures perform model-based clustering and density estimation without fixing the number of groups, which is valuable in genomics, population subtyping, and other settings where the number of clusters is unknown.
History
Ferguson defined the Dirichlet process in 1973 and Antoniak introduced mixtures of Dirichlet processes in 1974. Escobar and West's 1995 Gibbs-sampling approach made Dirichlet process mixtures a practical tool for density estimation and clustering.
Debates
- Sensitivity to the concentration parameter
- The number of inferred clusters depends on the concentration parameter and the base measure, so prior choices materially affect clustering conclusions and must be handled carefully.
Key figures
- Thomas Ferguson
- Charles Antoniak
- Michael Escobar
- Mike West
Related topics
Seminal works
- ferguson1973
- escobar1995
Frequently asked questions
- How does a Dirichlet process mixture decide how many clusters there are?
- It does not fix the number of clusters; the Dirichlet process allows arbitrarily many, and the posterior, driven by the data and the concentration parameter, places probability over different numbers of occupied clusters.