Machine learningDeep learning / NLP / CV
Multimodal Transformer
A Multimodal Transformer extends the standard Transformer architecture to process and jointly reason over two or more input modalities — most commonly text and images, but also audio, video, or structured data. Cross-modal attention layers allow information from one modality to inform representations in another, enabling tasks such as visual question answering, image captioning, and multimodal sentiment analysis.
Open in MethodMindSoonVideoSoon
Read the full method
Members only
Sign inSign in with a free account to read this section.
Sources
- Lu, J., Batra, D., Parikh, D., & Lee, S. (2019). ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. Advances in Neural Information Processing Systems (NeurIPS), 32. link ↗
- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR 139. link ↗
Related methods
Referenced by
Explainable TransformerMultimodal Convolutional Neural NetworkMultimodal Diffusion ModelMultimodal Doc2VecMultimodal GANMultimodal Graph Neural NetworkMultimodal GRUMultimodal Image ClassificationMultimodal LDA topic modelMultimodal LSTMMultimodal Multilayer PerceptronMultimodal Named Entity RecognitionMultimodal Object DetectionMultimodal question answeringMultimodal Recurrent Neural NetworkMultimodal Reinforcement LearningMultimodal RoBERTa-based ClassificationMultimodal Text SummarizationMultimodal Topic ModelingMultimodal Word2Vec