Machine learningDeep learning / NLP / CV

多言語Vision Transformer

多言語Vision Transformer (Multilingual ViT) は、Vision Transformerアーキテクチャを多言語に拡張したもので、多言語またはクロスリンガルな環境での画像理解と画像-テキスト推論を可能にします。これは、パッチベースの画像エンコーディングと多言語テキスト表現を組み合わせることで、画像キャプション生成、視覚的質問応答、クロスリンガル画像検索などのタスクにおいて、単一のモデルが多様な言語コミュニティに対応できるようにします。

MethodMindで開く近日公開動画近日公開Download slides

手法の全文を読む

会員限定

無料アカウントでログインすると、このセクションを読めます。

ログイン

Method map

The neighbourhood of related methods — select a node to explore.

多言語Vision Transformer

多言語RoBERTaベースの分類多言語文埋め込みマルチモーダルVision Transformer ビジョントランスフォーマー多言語画像分類

出典

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. International Conference on Learning Representations (ICLR 2021). link ↗
Bugliarello, E., Liu, F., Pfeiffer, J., Reddy, S., Elliott, D., Erdem, E., Erdem, A., & Lukasiewicz, T. (2022). IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages. International Conference on Machine Learning (ICML 2022). link ↗

このページの引用方法

ScholarGate. (2026, June 3). Multilingual Vision Transformer (Multilingual ViT). ScholarGate. https://scholargate.app/ja/deep-learning/multilingual-vision-transformer

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Compare side by side →

この手法を参照する項目

多言語画像分類

このページに誤りを見つけましたか?報告・修正提案 →