Machine learningDeep learning / NLP / CV

Multimodal Doc2Vec

Multimodal Doc2Vec extends the Doc2Vec paragraph-vector framework to incorporate information from more than one modality — typically text alongside images, audio, or structured metadata — producing a shared document-level embedding that captures semantics from multiple sources simultaneously. It is used for cross-modal retrieval, multi-source classification, and document representation where text alone is insufficient.

Open in MethodMindSoonVideoSoon

Read the full method

Members only

Sign in with a free account to read this section.

Sign in

Sources

  1. Le, Q. V., & Mikolov, T. (2014). Distributed Representations of Sentences and Documents. Proceedings of the 31st International Conference on Machine Learning (ICML), PMLR 32(2), 1188–1196. link
  2. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal Deep Learning. Proceedings of the 28th International Conference on Machine Learning (ICML), 689–696. link

Related methods

Referenced by

ScholarGateMultimodal Doc2Vec (Multimodal Doc2Vec (Paragraph Vector with Multi-Source Input)). Retrieved 2026-06-04 from https://scholargate.app/en/deep-learning/multimodal-doc2vec