Sequence-to-Sequence Models and Transformers
Neural architectures that map an input sequence to an output sequence — recurrent encoder-decoders, attention, and the transformer — which underpin translation, summarization, and modern generative language models.
Definition
A sequence-to-sequence model is a neural network that encodes an input sequence and generates an output sequence, typically using an attention mechanism to align the two.
Scope
Covers the neural sequence-modeling architectures central to current NLP: recurrent networks including LSTMs, the encoder-decoder framework, attention mechanisms, and the transformer. It addresses how these models are trained and decoded and why the transformer's self-attention enabled scaling to large language models. Embeddings and specific applications are covered in sibling topics.
Core questions
- How does the encoder-decoder framework transform one sequence into another?
- Why did attention overcome the bottleneck of fixed-size encodings?
- What does self-attention compute, and why is the transformer so scalable?
- How are LSTMs and transformers trained and used for generation?
Key concepts
- recurrent neural network
- LSTM
- encoder-decoder
- attention mechanism
- self-attention
- transformer
- positional encoding
- decoding
Key theories
- Long short-term memory
- A recurrent architecture with gated memory cells that mitigates the vanishing-gradient problem, enabling learning of long-range dependencies in sequences.
- Encoder-decoder with attention
- Mapping an input to an output sequence via an encoder and decoder, with attention letting the decoder focus on relevant input positions at each step.
- Self-attention transformer
- Replacing recurrence with self-attention so that every token directly attends to every other, enabling parallel training and the scaling behind large language models.
History
LSTMs (1997) made recurrent networks practical for long sequences. Sequence-to-sequence learning with attention (2014–2015) transformed machine translation, and the 2017 transformer replaced recurrence with self-attention, enabling the large pretrained generative models that now dominate the field.
Debates
- Recurrence versus attention
- Whether sequential recurrence or fully parallel attention is the better inductive bias for language; transformers largely won on scalability, though efficiency concerns keep alternative architectures alive.
Key figures
- Ashish Vaswani
- Ilya Sutskever
- Sepp Hochreiter
- Jürgen Schmidhuber
Related topics
Seminal works
- hochreiter1997
- sutskever2014
- vaswani2017
Frequently asked questions
- What problem does attention solve?
- Earlier encoder-decoder models compressed an entire input into a single fixed vector, which lost information for long sequences. Attention lets the decoder look back at all encoder states and weight the most relevant ones at each output step.