ScholarGate
助手
Machine learningDeep learning / NLP / CV

多模态视觉变换器

多模态视觉变换器(Multimodal ViT)将视觉变换器(Vision Transformer)架构扩展至能够联合处理和对齐来自多种模态(通常是图像和文本)的表示,利用自注意力(self-attention)和交叉注意力(cross-attention)机制。通过学习跨模态的共享或对齐嵌入空间,它能够实现诸如视觉问答、图文检索、视觉定位和图像字幕生成等任务。

在 MethodMind 中打开即将推出视频即将推出Download slides

阅读完整方法

仅限会员

使用免费账户登录即可阅读本节。

登录

Method map

The neighbourhood of related methods — select a node to explore.

+1 more

来源

  1. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations (ICLR). link
  2. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR 139. link

如何引用本页

ScholarGate. (2026, June 3). Multimodal Vision Transformer (Multimodal ViT). ScholarGate. https://scholargate.app/zh/deep-learning/multimodal-vision-transformer

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Compare side by side

被引用于

ScholarGateMultimodal Vision Transformer (Multimodal Vision Transformer (Multimodal ViT)). 于 2026-06-15 检索自 https://scholargate.app/zh/deep-learning/multimodal-vision-transformer · 数据集: https://doi.org/10.5281/zenodo.20539026