Machine learning

Multi-Head Self-Attention

Multi-head self-attention, introduced by Vaswani and colleagues in 2017, is the mechanism that lets every position in a sequence compute its relationship to all other positions in parallel. It is the core of the Transformer architecture and the foundation underneath BERT, GPT, and T5.

Open in MethodMindSoonVideoSoon

Read the full method

Members only

Sign in with a free account to read this section.

Sign in

Sources

  1. Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS. link
  2. Devlin, J. et al. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL. link

Related methods

Referenced by

ScholarGateSelf-Attention (Multi-Head Self-Attention (Transformer Core)). Retrieved 2026-06-04 from https://scholargate.app/en/deep-learning/self-attention-transformer