Machine learningDeep learning / NLP / CV

Multimodal Question Answering

Multimodal question answering (Multimodal QA) is a class of deep-learning methods that answer natural-language questions by jointly reasoning over information from multiple modalities — most commonly text and images, but also video, audio, and structured tables. Introduced prominently through the VQA benchmark in 2015, it has since expanded into a broad research area powering document understanding, medical diagnosis assistance, and embodied AI.

MethodMind'de açSoonVideoSoon

Tam yöntemi oku

Members only

Sign in with a free account to read this section.

Sign in

Sources

  1. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). VQA: Visual Question Answering. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2425–2433. DOI: 10.1109/ICCV.2015.279
  2. Xu, P., Zhu, X., & Clifton, D. A. (2023). Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10), 12113–12132. DOI: 10.1109/TPAMI.2023.3275156

Related methods

Referenced by

ScholarGateMultimodal question answering (Multimodal Question Answering (Cross-Modal QA)). Retrieved 2026-06-04 from https://scholargate.app/tr/deep-learning/multimodal-question-answering