Machine learningDeep learning / NLP / CV

Explainable Vision Transformer

Explainable Vision Transformer combines the strong image-recognition performance of Vision Transformers (ViT) with attribution techniques — such as relevance propagation, attention rollout, or gradient-weighted attention — that highlight which image regions drive each prediction. The approach enables researchers and practitioners to audit model decisions and satisfy transparency requirements without sacrificing accuracy.

Open in MethodMindSoonVideoSoon

Read the full method

Members only

Sign in with a free account to read this section.

Sign in

Sources

  1. Chefer, H., Gur, S., & Wolf, L. (2021). Transformer interpretability beyond attention visualization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 782–791. DOI: 10.1109/CVPR46437.2021.00084
  2. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR). link

Related methods

Referenced by

ScholarGateExplainable Vision Transformer (Explainable Vision Transformer (XViT / ViT with Post-hoc Attribution)). Retrieved 2026-06-04 from https://scholargate.app/en/deep-learning/explainable-vision-transformer