Machine learningDeep learning / NLP / CV

Multimodal Text Summarization

Multimodal text summarization generates a concise textual summary by jointly processing multiple input modalities — most commonly text and images, but also video frames or audio — using deep learning models that align visual and linguistic representations. The output is a natural-language summary that captures salient content from all available modalities.

MethodMind'de açSoonVideoSoon

Tam yöntemi oku

Members only

Sign in with a free account to read this section.

Sign in

Sources

  1. Zhu, J., Li, H., Liu, T., Zhou, Y., Zhang, J., & Zong, C. (2018). MSMO: Multimodal Summarization with Multimodal Output. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 4154–4164. link
  2. Zhu, J., Zhou, Y., Zhang, J., Li, H., Zong, C., & Li, C. (2020). Multimodal Summarization with Guidance of Multimodal Reference. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05), 9749–9756. link

Related methods

Referenced by

ScholarGateMultimodal Text Summarization (Multimodal Text Summarization (Cross-Modal Abstractive and Extractive Summarization)). Retrieved 2026-06-04 from https://scholargate.app/tr/deep-learning/multimodal-text-summarization