Machine learning

Vision Transformer

The Vision Transformer (ViT), introduced by Dosovitskiy and colleagues in 2021, splits an image into fixed-size patches, treats those patches as a sequence, and applies the Transformer self-attention mechanism to image classification. Given enough training data, it surpasses convolutional neural networks (CNNs).

Open in MethodMindSoonVideoSoon

Read the full method

Members only

Sign in with a free account to read this section.

Sign in

Sources

  1. Dosovitskiy, A. et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR. link
  2. Touvron, H. et al. (2021). Training Data-Efficient Image Transformers. ICML. link

Related methods

Referenced by

ScholarGateVision Transformer (Vision Transformer (ViT)). Retrieved 2026-06-04 from https://scholargate.app/en/deep-learning/vision-transformer