Process / pipeline

多模态自然语言处理 — 视觉语言理解

多模态自然语言处理（Multimodal NLP）是一类自然语言处理流水线，它将文本与一种或多种额外的数据模态——最常见的是图像，但也包括音频和视频——相结合，以执行理解和生成任务，例如视觉问答、图像字幕生成和多模态情感识别。该领域随着 CLIP (Radford et al., 2021) 的出现而形成现代形态，并在此后通过 BLIP-2 (Li et al., 2023) 等架构取得了进展，这些架构连接了冻结的图像编码器和大型语言模型。

在 MethodMind 中打开即将推出视频即将推出Download slides

阅读完整方法

仅限会员

使用免费账户登录即可阅读本节。

Method map

The neighbourhood of related methods — select a node to explore.

多模态自然语言处理

注意力机制 BERT 嵌入情感分析 Vision Transformer

来源

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning (ICML), 8748–8763. link ↗
Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. Proceedings of the 40th International Conference on Machine Learning (ICML), 19730–19742. link ↗

如何引用本页

ScholarGate. (2026, June 1). Multimodal Natural Language Processing. ScholarGate. https://scholargate.app/zh/text-mining/multimodal-nlp

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Compare side by side →

发现本页有问题？报告或提出修改建议 →

阅读完整方法

Method map

来源

如何引用本页

相关方法

Which method?