Perplexity measures how surprised a language model is by held-out text; lower perplexity means the model assigns higher probability to the observed words, indicating a better fit.

Why does language modeling need smoothing?

Any finite corpus omits many valid word sequences, so a naive model would assign them zero probability. Smoothing redistributes a little probability mass to unseen events so the model can handle novel text.

Language Modeling

Assigning probabilities to sequences of words, the foundational task that lets systems predict, score, and generate text — from classical n-gram counters to neural language models.

Tafuta mada kwa PaperMindHivi karibuniFind papers & topics

Tools & resources

Pakua slaidi

Learn & explore

VideoHivi karibuni

Definition

A language model is a probability distribution over sequences of words or tokens, typically defined by predicting each token from its preceding context.

Scope

Covers the language-modeling task itself: estimating the probability of a word given its context, n-gram models and their smoothing techniques, evaluation by perplexity, and the transition to neural and distributed representations. It situates large language models as the modern incarnation of the same task. Detailed neural architectures are treated in the statistical-and-neural NLP area.

Core questions

How can the probability of a sentence be decomposed into conditional word probabilities?
How does smoothing handle word sequences never seen in training?
How is perplexity used to evaluate and compare language models?
What did neural language models change relative to n-gram models?

Key concepts

n-gram
Markov assumption
smoothing
perplexity
backoff and interpolation
distributed word representations
cross-entropy
next-token prediction

Key theories

N-gram Markov modeling: Approximating the probability of a word by conditioning only on the previous n−1 words, turning language modeling into a tractable counting-and-smoothing problem.
Neural probabilistic language model: Replacing sparse n-gram counts with a neural network that learns distributed word representations, mitigating the curse of dimensionality and enabling generalization to unseen contexts.

History

Shannon's information theory framed language as a predictable stochastic source, and the speech-recognition community at IBM made n-gram modeling central in the 1980s. Bengio and colleagues introduced neural probabilistic language models in 2003, seeding the distributed-representation approach that, scaled up, produced today's large language models.

Debates

Counting versus learned representations: Whether language is best modeled by smoothed counts over discrete sequences or by neural networks that learn continuous representations; neural methods now dominate but inherit the same probabilistic objective.

Key figures

Claude Shannon
Frederick Jelinek
Yoshua Bengio
Daniel Jurafsky

Seminal works

shannon1948
bengio2003
jurafsky2025

Frequently asked questions

What is perplexity?: Perplexity measures how surprised a language model is by held-out text; lower perplexity means the model assigns higher probability to the observed words, indicating a better fit.
Why does language modeling need smoothing?: Any finite corpus omits many valid word sequences, so a naive model would assign them zero probability. Smoothing redistributes a little probability mass to unseen events so the model can handle novel text.