ScholarGate
Assistant

Dirichlet Process and Mixture Models

The Dirichlet process is a prior over distributions whose discreteness makes it a natural basis for mixture models that infer the number of clusters from the data.

Definition

The Dirichlet process is a stochastic process whose realizations are probability measures; a Dirichlet process mixture model convolves these discrete random measures with a kernel, yielding a mixture with a random, data-determined number of components.

Scope

This topic covers the Dirichlet process and its concentration parameter and base measure, the Polya urn and Chinese restaurant process representations, the clustering they induce, and the Dirichlet process mixture model used for density estimation and clustering with an unbounded number of components.

Core questions

  • What are the concentration parameter and base measure of a Dirichlet process?
  • How do the Polya urn and Chinese restaurant process describe its clustering?
  • How does a Dirichlet process mixture infer the number of clusters?
  • How is posterior inference for these models carried out?

Key concepts

  • Dirichlet process
  • concentration parameter
  • base measure
  • Chinese restaurant process
  • Polya urn scheme
  • infinite mixture model
  • clustering

Key theories

Dirichlet process
Ferguson defined the Dirichlet process so that its values on any finite partition are Dirichlet-distributed, giving a conjugate, almost-surely discrete prior over distributions.
Dirichlet process mixtures
Mixing a continuous kernel over a Dirichlet-process-distributed measure yields flexible density estimates and clustering with an unbounded number of components, with inference via Gibbs sampling.

Clinical relevance

Dirichlet process mixtures perform model-based clustering and density estimation without fixing the number of groups, which is valuable in genomics, population subtyping, and other settings where the number of clusters is unknown.

History

Ferguson defined the Dirichlet process in 1973 and Antoniak introduced mixtures of Dirichlet processes in 1974. Escobar and West's 1995 Gibbs-sampling approach made Dirichlet process mixtures a practical tool for density estimation and clustering.

Debates

Sensitivity to the concentration parameter
The number of inferred clusters depends on the concentration parameter and the base measure, so prior choices materially affect clustering conclusions and must be handled carefully.

Key figures

  • Thomas Ferguson
  • Charles Antoniak
  • Michael Escobar
  • Mike West

Related topics

Seminal works

  • ferguson1973
  • escobar1995

Frequently asked questions

How does a Dirichlet process mixture decide how many clusters there are?
It does not fix the number of clusters; the Dirichlet process allows arbitrarily many, and the posterior, driven by the data and the concentration parameter, places probability over different numbers of occupied clusters.

Methods for this concept

Related concepts