Machine learning

Vision Transformer

Vision Transformer (ViT) · Also known as: Görsel Transformer (ViT), görsel transformer, ViT, patch transformer for images

The Vision Transformer (ViT), introduced by Dosovitskiy and colleagues in 2021, splits an image into fixed-size patches, treats those patches as a sequence, and applies the Transformer self-attention mechanism to image classification. Given enough training data, it surpasses convolutional neural networks (CNNs).

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Vision Transformer

Diffusion Model Generative Adversarial N…Random Forest Support Vector Machine Variational Autoencoder BERT Fine-Tuning CLIP Domain-adaptive transfor…Domain-adaptive vision t…Explainable Vision Trans…

+27 more

When to use it

Use ViT for image classification or prediction on continuous (pixel) image data when you have a large dataset — on the order of 1000 or more images, and ideally far more — and access to a GPU. It works best with a large pre-training corpus or transfer learning from a pre-trained model. On small image datasets (below a few hundred images) ViT cannot learn its patch-based attention reliably, and a CNN or classical machine-learning method such as Random Forest or SVM is the safer choice.

Strengths & limitations

Strengths

Surpasses CNNs on large image datasets by modelling global relationships through self-attention.
Treats an image as a sequence of patches, applying the proven Transformer architecture to vision.
Captures long-range dependencies across the whole image, not just local neighbourhoods.
Benefits strongly from large-scale pre-training and transfer learning to pre-trained checkpoints.
Does not assume normally distributed data.

Limitations

Requires a large training set (about 1000 images or more); on small data it underperforms CNNs.
A GPU is required, and training from scratch is data- and compute-hungry.
On very small datasets (below a few hundred images) patch-based attention fails to learn reliably.
Strong results typically depend on a large pre-training corpus or transfer learning rather than training from scratch.

Frequently asked

How is ViT different from a CNN?

A CNN processes an image through local convolutional filters, while ViT splits the image into fixed-size patches, treats them as a sequence, and uses self-attention so any patch can relate to any other. This lets ViT model global relationships and, on large datasets, surpass CNNs.

How much data does ViT need?

ViT is data-hungry: it works best with around 1000 or more images and a large pre-training corpus. Below a few hundred images its patch-based attention cannot learn reliably, and a CNN or classical method is preferable.

Do I need a GPU?

Yes. ViT requires a GPU, and training it is data- and compute-intensive. Using a pre-trained checkpoint with transfer learning reduces the burden considerably.

What if my image dataset is small?

On small datasets ViT underperforms. Below roughly 500 images consider a method like Random Forest, and below a few hundred a CNN or SVM is the safer choice, or rely on transfer learning from a pre-trained model.

Sources

Dosovitskiy, A. et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR. link ↗
Touvron, H. et al. (2021). Training Data-Efficient Image Transformers. ICML. link ↗

How to cite this page

ScholarGate. (2026, June 1). Vision Transformer (ViT). ScholarGate. https://scholargate.app/en/deep-learning/vision-transformer

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Diffusion ModelDeep learning↔ compare
Generative Adversarial NetworkDeep learning↔ compare
Random ForestMachine learning↔ compare
Support Vector MachineMachine learning↔ compare
Variational AutoencoderDeep learning↔ compare

Compare side by side →

Related reference concepts

Object Recognition and Detection Convolutional and Sequence Models Self-Supervised and Representation Learning Computer Vision Deep Learning Image Segmentation

Spotted an issue on this page? Report or suggest a fix →

Machine learning

Vision Transformer

Vision Transformer (ViT) · Also known as: Görsel Transformer (ViT), görsel transformer, ViT, patch transformer for images

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Vision Transformer

+27 more

When to use it

Strengths & limitations

Strengths

Surpasses CNNs on large image datasets by modelling global relationships through self-attention.
Treats an image as a sequence of patches, applying the proven Transformer architecture to vision.
Captures long-range dependencies across the whole image, not just local neighbourhoods.
Benefits strongly from large-scale pre-training and transfer learning to pre-trained checkpoints.
Does not assume normally distributed data.

Limitations

Requires a large training set (about 1000 images or more); on small data it underperforms CNNs.
A GPU is required, and training from scratch is data- and compute-hungry.
On very small datasets (below a few hundred images) patch-based attention fails to learn reliably.
Strong results typically depend on a large pre-training corpus or transfer learning rather than training from scratch.

Frequently asked

How is ViT different from a CNN?

How much data does ViT need?

Do I need a GPU?

Yes. ViT requires a GPU, and training it is data- and compute-intensive. Using a pre-trained checkpoint with transfer learning reduces the burden considerably.

What if my image dataset is small?

Sources

Dosovitskiy, A. et al. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR. link ↗
Touvron, H. et al. (2021). Training Data-Efficient Image Transformers. ICML. link ↗

How to cite this page

ScholarGate. (2026, June 1). Vision Transformer (ViT). ScholarGate. https://scholargate.app/en/deep-learning/vision-transformer

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Diffusion ModelDeep learning↔ compare
Generative Adversarial NetworkDeep learning↔ compare
Random ForestMachine learning↔ compare
Support Vector MachineMachine learning↔ compare
Variational AutoencoderDeep learning↔ compare

Compare side by side →

Similar methods

Related reference concepts

Object Recognition and Detection Convolutional and Sequence Models Self-Supervised and Representation Learning Computer Vision Deep Learning Image Segmentation

Spotted an issue on this page? Report or suggest a fix →

Vision Transformer

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts

Vision Transformer

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts