Latent structure

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA — Blei, Ng & Jordan 2003) · Also known as: LDA, topic model, Blei-Ng-Jordan model, probabilistic topic modeling, generative topic model

Latent Dirichlet Allocation (LDA) is a generative probabilistic model for collections of discrete data, introduced by Blei, Ng, and Jordan in 2003. It treats each document as a mixture of latent topics and each topic as a probability distribution over words, enabling unsupervised discovery of thematic structure across large text corpora. It is one of the most cited papers in machine learning and natural language processing.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Latent Dirichlet Allocation

K-Means Clustering Non-negative Matrix Fact…Word2Vec Bayesian single-cell RNA…Dirichlet Process Mixtur…Explainable LDA Topic Mo…Multimodal NMF Topic Mod…Self-supervised NMF Topi…Semi-supervised Topic Mo…Variational Inference

When to use it

LDA is appropriate when the goal is to discover latent thematic structure in a large collection of text documents without supervision, and when documents are plausibly generated by mixtures of topics. Common scenarios include summarising corpora, exploratory analysis of scientific literature, social media, or policy documents, and as a feature-extraction step prior to downstream supervised tasks. Key assumptions: the bag-of-words assumption (word order is ignored), exchangeability of words within a document and of documents within the corpus, and a fixed number of topics K specified by the analyst. The model works best with at least several hundred documents; very short documents (tweets, single sentences) may produce unstable topic estimates.

Strengths & limitations

Strengths

Principled probabilistic foundation: provides full generative story and uncertainty estimates over topic assignments.
Scales to large corpora with efficient variational EM or online variational Bayes inference.
Produces human-interpretable topics as ranked word lists.
Document-topic proportions serve as compact, dense feature vectors for downstream classification or retrieval.
Extensible: the basic model has spawned hundreds of variants (Author-Topic, Correlated Topic Model, Dynamic LDA, Supervised LDA) addressing specific domain needs.

Limitations

The number of topics K must be specified in advance; model selection (e.g., held-out perplexity, coherence scores) adds cost.
The bag-of-words assumption discards word order and syntactic structure, limiting nuanced semantic capture.
Topics can be difficult to interpret, especially in short or noisy text.
Inference via variational EM is sensitive to initialisation and may converge to local optima.
Gibbs sampling can be slow on very large vocabularies without efficient implementations.

Frequently asked

How do I choose the number of topics K?

There is no single correct answer. Common approaches include plotting topic coherence (e.g., the C_v coherence score) across a range of K values and selecting the elbow or peak, evaluating held-out perplexity on a test set, and qualitative inspection of the top words per topic. It is advisable to fit models for several K values (e.g., 10, 20, 30, 50) and compare both quantitative metrics and interpretability before committing.

What is the difference between LDA and pLSA?

Probabilistic latent semantic analysis (pLSA; Hofmann 1999) is a closely related model but treats document-topic proportions as fixed parameters rather than random variables. LDA places Dirichlet priors on those proportions, making it a fully Bayesian generative model. As a result, LDA generalises to unseen documents through predictive inference, while pLSA cannot directly assign topic distributions to new documents without additional training.

Does LDA work on languages other than English?

Yes. LDA is language-agnostic: it treats text as sequences of tokens and learns co-occurrence patterns in the vocabulary. The key preprocessing steps — tokenisation, stop-word removal, and stemming or lemmatisation — must be adapted for the target language, but the model itself is unchanged.

How should I evaluate whether my topics are good?

Evaluation combines quantitative and qualitative methods. Quantitatively, topic coherence scores (particularly C_v or UMass coherence) correlate well with human judgements of topic quality. Held-out perplexity measures predictive fit. Qualitatively, inspect the top 10–20 words per topic, read representative documents assigned high probability to each topic, and confirm that topics are distinctive and non-redundant. Topics with many overlapping high-probability words across different topics indicate K may be too large.

Sources

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. DOI: 10.5555/944919.944937 ↗
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84. DOI: 10.1145/2133806.2133826 ↗
Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Ch. 9). Springer. ISBN: 978-0-387-31073-2

How to cite this page

ScholarGate. (2026, June 3). Latent Dirichlet Allocation (LDA — Blei, Ng & Jordan 2003). ScholarGate. https://scholargate.app/en/machine-learning/latent-dirichlet-allocation

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

K-Means ClusteringMachine learning↔ compare
Non-negative Matrix FactorizationMachine learning↔ compare
Word2VecText mining↔ compare

Compare side by side →

Referenced by

Bayesian single-cell RNA-seq analysis Dirichlet Process Mixture Model Explainable LDA Topic Model Multimodal NMF Topic Model Non-negative Matrix Factorization Self-supervised NMF Topic Model Semi-supervised Topic Modeling Variational Inference

Related reference concepts

Latent Semantic and Topic Models Topic Modeling and Text Mining Text Classification Text Clustering Text Representation and Classification Text Classification and Sentiment Analysis

Spotted an issue on this page? Report or suggest a fix →

Latent structure

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA — Blei, Ng & Jordan 2003) · Also known as: LDA, topic model, Blei-Ng-Jordan model, probabilistic topic modeling, generative topic model

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Latent Dirichlet Allocation

When to use it

Strengths & limitations

Strengths

Principled probabilistic foundation: provides full generative story and uncertainty estimates over topic assignments.
Scales to large corpora with efficient variational EM or online variational Bayes inference.
Produces human-interpretable topics as ranked word lists.
Document-topic proportions serve as compact, dense feature vectors for downstream classification or retrieval.
Extensible: the basic model has spawned hundreds of variants (Author-Topic, Correlated Topic Model, Dynamic LDA, Supervised LDA) addressing specific domain needs.

Limitations

The number of topics K must be specified in advance; model selection (e.g., held-out perplexity, coherence scores) adds cost.
The bag-of-words assumption discards word order and syntactic structure, limiting nuanced semantic capture.
Topics can be difficult to interpret, especially in short or noisy text.
Inference via variational EM is sensitive to initialisation and may converge to local optima.
Gibbs sampling can be slow on very large vocabularies without efficient implementations.

Frequently asked

How do I choose the number of topics K?

What is the difference between LDA and pLSA?

Does LDA work on languages other than English?

How should I evaluate whether my topics are good?

Sources

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022. DOI: 10.5555/944919.944937 ↗
Blei, D. M. (2012). Probabilistic topic models. Communications of the ACM, 55(4), 77–84. DOI: 10.1145/2133806.2133826 ↗
Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Ch. 9). Springer. ISBN: 978-0-387-31073-2

How to cite this page

ScholarGate. (2026, June 3). Latent Dirichlet Allocation (LDA — Blei, Ng & Jordan 2003). ScholarGate. https://scholargate.app/en/machine-learning/latent-dirichlet-allocation

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

K-Means ClusteringMachine learning↔ compare
Non-negative Matrix FactorizationMachine learning↔ compare
Word2VecText mining↔ compare

Compare side by side →

Referenced by

Similar methods

Related reference concepts

Latent Semantic and Topic Models Topic Modeling and Text Mining Text Classification Text Clustering Text Representation and Classification Text Classification and Sentiment Analysis

Spotted an issue on this page? Report or suggest a fix →

Latent Dirichlet Allocation (LDA)

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Related methods

Which method?

Referenced by

Similar methods

Related reference concepts

Latent Dirichlet Allocation (LDA)

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Related methods

Which method?

Referenced by

Similar methods

Related reference concepts