Machine learning
Multi-Head Self-Attention
Multi-head self-attention, introduced by Vaswani and colleagues in 2017, is the mechanism that lets every position in a sequence compute its relationship to all other positions in parallel. It is the core of the Transformer architecture and the foundation underneath BERT, GPT, and T5.
Open in MethodMindSoonVideoSoon
Read the full method
Members only
Sign inSign in with a free account to read this section.