Language Modeling
Assigning probabilities to sequences of words, the foundational task that lets systems predict, score, and generate text — from classical n-gram counters to neural language models.
Definition
A language model is a probability distribution over sequences of words or tokens, typically defined by predicting each token from its preceding context.
Scope
Covers the language-modeling task itself: estimating the probability of a word given its context, n-gram models and their smoothing techniques, evaluation by perplexity, and the transition to neural and distributed representations. It situates large language models as the modern incarnation of the same task. Detailed neural architectures are treated in the statistical-and-neural NLP area.
Core questions
- How can the probability of a sentence be decomposed into conditional word probabilities?
- How does smoothing handle word sequences never seen in training?
- How is perplexity used to evaluate and compare language models?
- What did neural language models change relative to n-gram models?
Key concepts
- n-gram
- Markov assumption
- smoothing
- perplexity
- backoff and interpolation
- distributed word representations
- cross-entropy
- next-token prediction
Key theories
- N-gram Markov modeling
- Approximating the probability of a word by conditioning only on the previous n−1 words, turning language modeling into a tractable counting-and-smoothing problem.
- Neural probabilistic language model
- Replacing sparse n-gram counts with a neural network that learns distributed word representations, mitigating the curse of dimensionality and enabling generalization to unseen contexts.
History
Shannon's information theory framed language as a predictable stochastic source, and the speech-recognition community at IBM made n-gram modeling central in the 1980s. Bengio and colleagues introduced neural probabilistic language models in 2003, seeding the distributed-representation approach that, scaled up, produced today's large language models.
Debates
- Counting versus learned representations
- Whether language is best modeled by smoothed counts over discrete sequences or by neural networks that learn continuous representations; neural methods now dominate but inherit the same probabilistic objective.
Key figures
- Claude Shannon
- Frederick Jelinek
- Yoshua Bengio
- Daniel Jurafsky
Related topics
Seminal works
- shannon1948
- bengio2003
- jurafsky2025
Frequently asked questions
- What is perplexity?
- Perplexity measures how surprised a language model is by held-out text; lower perplexity means the model assigns higher probability to the observed words, indicating a better fit.
- Why does language modeling need smoothing?
- Any finite corpus omits many valid word sequences, so a naive model would assign them zero probability. Smoothing redistributes a little probability mass to unseen events so the model can handle novel text.