What problem does attention solve?

Earlier encoder-decoder models compressed an entire input into a single fixed vector, which lost information for long sequences. Attention lets the decoder look back at all encoder states and weight the most relevant ones at each output step.

Sequence-to-Sequence Models and Transformers

Neural architectures that map an input sequence to an output sequence — recurrent encoder-decoders, attention, and the transformer — which underpin translation, summarization, and modern generative language models.

یافتن موضوع با PaperMindبه‌زودیFind papers & topics

Tools & resources

دریافت اسلایدها

Learn & explore

ویدیوبه‌زودی

Definition

A sequence-to-sequence model is a neural network that encodes an input sequence and generates an output sequence, typically using an attention mechanism to align the two.

Scope

Covers the neural sequence-modeling architectures central to current NLP: recurrent networks including LSTMs, the encoder-decoder framework, attention mechanisms, and the transformer. It addresses how these models are trained and decoded and why the transformer's self-attention enabled scaling to large language models. Embeddings and specific applications are covered in sibling topics.

Core questions

How does the encoder-decoder framework transform one sequence into another?
Why did attention overcome the bottleneck of fixed-size encodings?
What does self-attention compute, and why is the transformer so scalable?
How are LSTMs and transformers trained and used for generation?

Key concepts

recurrent neural network
LSTM
encoder-decoder
attention mechanism
self-attention
transformer
positional encoding
decoding

Key theories

Long short-term memory: A recurrent architecture with gated memory cells that mitigates the vanishing-gradient problem, enabling learning of long-range dependencies in sequences.
Encoder-decoder with attention: Mapping an input to an output sequence via an encoder and decoder, with attention letting the decoder focus on relevant input positions at each step.
Self-attention transformer: Replacing recurrence with self-attention so that every token directly attends to every other, enabling parallel training and the scaling behind large language models.

History

LSTMs (1997) made recurrent networks practical for long sequences. Sequence-to-sequence learning with attention (2014–2015) transformed machine translation, and the 2017 transformer replaced recurrence with self-attention, enabling the large pretrained generative models that now dominate the field.

Debates

Recurrence versus attention: Whether sequential recurrence or fully parallel attention is the better inductive bias for language; transformers largely won on scalability, though efficiency concerns keep alternative architectures alive.

Key figures

Ashish Vaswani
Ilya Sutskever
Sepp Hochreiter
Jürgen Schmidhuber

Seminal works

hochreiter1997
sutskever2014
vaswani2017

Frequently asked questions

What problem does attention solve?: Earlier encoder-decoder models compressed an entire input into a single fixed vector, which lost information for long sequences. Attention lets the decoder look back at all encoder states and weight the most relevant ones at each output step.