Machine learning

Multi-Head Self-Attention

Multi-Head Self-Attention (Transformer Core) · Also known as: Öz-Dikkat ve Çok Başlı Dikkat (Multi-Head Self-Attention), öz-dikkat, multi-head attention, scaled dot-product attention

Multi-head self-attention, introduced by Vaswani and colleagues in 2017, is the mechanism that lets every position in a sequence compute its relationship to all other positions in parallel. It is the core of the Transformer architecture and the foundation underneath BERT, GPT, and T5.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Self-Attention

BERT Fine-Tuning GPT Fine-Tuning LoRA and PEFT Random Forest XGBoost Attention Mechanism Bidirectional RNN Retrieval-Augmented Gene…Sequence-to-Sequence Mod…

When to use it

Use self-attention when modelling text or other sequential data where relationships between distant elements matter and you have enough data to train a large model — at least about 100 observations, with 500 or more strongly preferred. A GPU is recommended, positional encoding is required, and be aware that compute cost grows quadratically with sequence length. On very small samples it cannot learn reliable representations; simpler machine-learning models are safer there.

Strengths & limitations

Strengths

Captures long-range dependencies directly, without the bottleneck of step-by-step recurrence.
Fully parallel computation across positions, which scales well on modern hardware.
Multiple heads can attend to different relationship patterns simultaneously.
Serves as the shared backbone for BERT, GPT, and T5, with extensive transfer-learning support.

Limitations

Compute and memory cost grow quadratically with sequence length, making very long sequences expensive.
A GPU is effectively required for practical training.
Needs positional encoding because attention alone is order-agnostic.
On small datasets (n below about 500) it overfits and fails to learn reliable representations.

Frequently asked

Why is the attention score scaled by the square root of the key dimension?

Without scaling, dot products grow large as the key dimension increases, pushing the softmax into regions with vanishing gradients. Dividing by the square root of the key dimension keeps the scores in a stable range.

What do the multiple heads add?

Each head learns its own query, key, and value projections, so different heads can attend to different kinds of relationships at once. Their outputs are concatenated and projected back together into a single representation.

Why is positional encoding needed?

Self-attention treats positions as interchangeable, so on its own it has no notion of order. Positional encoding injects information about each token's position so the model can use word order.

How much data does it need?

It is data-hungry: with under about 500 examples it overfits and cannot learn reliable representations, and below roughly 100 training a Transformer is not meaningful. For small samples, simpler models such as random forest or XGBoost are preferable.

Sources

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. link ↗
Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL. link ↗

How to cite this page

ScholarGate. (2026, June 1). Multi-Head Self-Attention (Transformer Core). ScholarGate. https://scholargate.app/en/deep-learning/self-attention-transformer

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Compare side by side →

Referenced by

Attention Mechanism Bidirectional RNN Retrieval-Augmented Generation Sequence-to-Sequence Model

Related reference concepts

Sequence-to-Sequence Models and Transformers Neural Language Models and Word Embeddings Convolutional and Sequence Models Self-Supervised and Representation Learning Statistical and Neural NLP Deep Learning

Spotted an issue on this page? Report or suggest a fix →

Machine learning

Multi-Head Self-Attention

Multi-Head Self-Attention (Transformer Core) · Also known as: Öz-Dikkat ve Çok Başlı Dikkat (Multi-Head Self-Attention), öz-dikkat, multi-head attention, scaled dot-product attention

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Self-Attention

BERT Fine-Tuning GPT Fine-Tuning LoRA and PEFT Random Forest XGBoost Attention Mechanism Bidirectional RNN Retrieval-Augmented Gene…Sequence-to-Sequence Mod…

When to use it

Strengths & limitations

Strengths

Captures long-range dependencies directly, without the bottleneck of step-by-step recurrence.
Fully parallel computation across positions, which scales well on modern hardware.
Multiple heads can attend to different relationship patterns simultaneously.
Serves as the shared backbone for BERT, GPT, and T5, with extensive transfer-learning support.

Limitations

Compute and memory cost grow quadratically with sequence length, making very long sequences expensive.
A GPU is effectively required for practical training.
Needs positional encoding because attention alone is order-agnostic.
On small datasets (n below about 500) it overfits and fails to learn reliable representations.

Frequently asked

Why is the attention score scaled by the square root of the key dimension?

What do the multiple heads add?

Why is positional encoding needed?

Self-attention treats positions as interchangeable, so on its own it has no notion of order. Positional encoding injects information about each token's position so the model can use word order.

How much data does it need?

Sources

Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. link ↗
Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL. link ↗

How to cite this page

ScholarGate. (2026, June 1). Multi-Head Self-Attention (Transformer Core). ScholarGate. https://scholargate.app/en/deep-learning/self-attention-transformer

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Compare side by side →

Referenced by

Attention Mechanism Bidirectional RNN Retrieval-Augmented Generation Sequence-to-Sequence Model

Related reference concepts

Spotted an issue on this page? Report or suggest a fix →

Multi-Head Self-Attention

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts

Multi-Head Self-Attention

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts

Multi-Head Self-Attention

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Related methods

Which method?

Referenced by

Similar methods

Related reference concepts

Multi-Head Self-Attention

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Related methods

Which method?

Referenced by

Similar methods

Related reference concepts