What is the difference between static and contextual embeddings?

A static embedding gives a word one fixed vector regardless of context, so 'bank' has a single representation. A contextual embedding produces a different vector for each occurrence, distinguishing a river bank from a financial bank.

Neural Language Models and Word Embeddings

Learning dense vector representations of words and contexts from raw text — from word2vec embeddings to contextual representations like BERT — that encode meaning as geometry.

Definition

A word embedding is a dense real-valued vector representing a word's meaning, learned so that distributional similarity is reflected in vector-space proximity; contextual embeddings extend this to representations that depend on the surrounding text.

Scope

Covers distributed and neural representations of language: the distributional hypothesis, static word embeddings such as word2vec and GloVe, neural language models, and contextual embeddings from pretrained transformers like BERT. It addresses how representations are trained, evaluated, and transferred to downstream tasks. Transformer architecture details and generation are covered in a sibling topic.

Core questions

What is the distributional hypothesis and how do embeddings operationalize it?
How does word2vec learn word vectors from co-occurrence?
How do contextual embeddings differ from static ones?
Why did pretraining and transfer learning transform NLP?

Key concepts

distributional hypothesis
word embedding
word2vec
skip-gram
contextual embedding
pretraining and fine-tuning
transfer learning
masked language modeling

Key theories

Distributional hypothesis: The idea that words occurring in similar contexts have similar meanings, which underlies all embedding methods by deriving meaning from co-occurrence statistics.
Contextual pretraining: Pretraining deep bidirectional models on large unlabeled text, as in BERT, to produce context-sensitive representations that transfer to many downstream tasks with little fine-tuning.

History

Harris's distributional hypothesis was operationalized first by count-based vector-space models, then by Bengio's neural language model (2003) and Mikolov's efficient word2vec (2013). The 2018–2019 arrival of contextual models such as ELMo and BERT made pretraining-and-fine-tuning the dominant paradigm.

Debates

What do embeddings actually encode?: Whether learned representations capture genuine semantic and syntactic structure or merely co-occurrence regularities and biases present in training data, a central question for interpretability.

Key figures

Yoshua Bengio
Tomas Mikolov
Jacob Devlin
Zellig Harris

Seminal works

bengio2003
mikolov2013
devlin2019

Frequently asked questions

What is the difference between static and contextual embeddings?: A static embedding gives a word one fixed vector regardless of context, so 'bank' has a single representation. A contextual embedding produces a different vector for each occurrence, distinguishing a river bank from a financial bank.