Machine learningDeep learning / NLP / CV

Multimodal Sentence Embeddings

Multimodal sentence embeddings map text and images (and sometimes audio or video) into a shared continuous vector space, so that semantically related pairs from different modalities land close together. Trained by contrastive objectives on large paired corpora, these representations power cross-modal retrieval, zero-shot classification, and vision-language reasoning.

Open in MethodMindSoonVideoSoon

Read the full method

Members only

Sign in with a free account to read this section.

Sign in

Sources

  1. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning (ICML), pp. 8748–8763. PMLR. link
  2. Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., & Mikolov, T. (2013). DeViSE: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 26. link

Related methods

Referenced by

ScholarGateMultimodal Sentence Embeddings (Multimodal Sentence Embeddings (Joint Vision-Language Representation Learning)). Retrieved 2026-06-04 from https://scholargate.app/en/deep-learning/multimodal-sentence-embeddings