Machine learningDeep learning / NLP / CV

Multimodal Transformer

A Multimodal Transformer extends the standard Transformer architecture to process and jointly reason over two or more input modalities — most commonly text and images, but also audio, video, or structured data. Cross-modal attention layers allow information from one modality to inform representations in another, enabling tasks such as visual question answering, image captioning, and multimodal sentiment analysis.

MethodMind'de açSoonVideoSoon

Tam yöntemi oku

Members only

Sign in with a free account to read this section.

Sign in

Sources

  1. Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Advances in Neural Information Processing Systems (NeurIPS), 32. link
  2. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR 139. link

Related methods

Referenced by

ScholarGateMultimodal Transformer (Multimodal Transformer (Cross-Modal Attention-Based Architecture)). Retrieved 2026-06-04 from https://scholargate.app/tr/deep-learning/multimodal-transformer