Machine learningDeep learning / NLP / CV

多语言视觉Transformer

多语言视觉Transformer（Multilingual ViT）扩展了视觉Transformer架构，使其能够跨多种语言运行，从而在多语言或跨语言环境中实现图像理解和图像-文本推理。它将基于图像块的图像编码与多语言文本表示相结合，使得单个模型能够为不同的语言社区提供服务，完成诸如图像字幕生成、视觉问答和跨语言图像检索等任务。

在 MethodMind 中打开即将推出视频即将推出Download slides

阅读完整方法

仅限会员

使用免费账户登录即可阅读本节。

Method map

The neighbourhood of related methods — select a node to explore.

多语言视觉Transformer

基于多语言 RoBERTa 的分类多语言句子嵌入多模态视觉变换器 Vision Transformer 多语言图像分类

来源

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations (ICLR 2021). link ↗
Bugliarello, E., Liu, F., Pfeiffer, J., Reddy, S., Elliott, D., Erdem, E., Erdem, A., & Lukasiewicz, T. (2022). IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages. International Conference on Machine Learning (ICML 2022). link ↗

如何引用本页

ScholarGate. (2026, June 3). Multilingual Vision Transformer (Multilingual ViT). ScholarGate. https://scholargate.app/zh/deep-learning/multilingual-vision-transformer

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Compare side by side →

被引用于

多语言图像分类

发现本页有问题？报告或提出修改建议 →