ScholarGate
助手
Machine learningDeep learning / NLP / CV

多模态图像分类

多模态图像分类通过整合除图像特征外的其他模态(如文本描述、音频或结构化元数据)来扩展标准的视觉分类。独立的编码器处理每种模态,然后融合它们的表示,最后由联合分类器分配目标标签。CLIP等模型表明,图像-文本对齐能够实现大规模的零样本和少样本图像分类。

在 MethodMind 中打开即将推出视频即将推出Download slides

阅读完整方法

仅限会员

使用免费账户登录即可阅读本节。

登录

Method map

The neighbourhood of related methods — select a node to explore.

来源

  1. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR 139, 8748–8763. link
  2. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. Proceedings of the 28th International Conference on Machine Learning (ICML), 689–696. link

如何引用本页

ScholarGate. (2026, June 3). Multimodal Image Classification (Vision + Auxiliary Modality Fusion). ScholarGate. https://scholargate.app/zh/deep-learning/multimodal-image-classification

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Compare side by side

被引用于

ScholarGateMultimodal Image Classification (Multimodal Image Classification (Vision + Auxiliary Modality Fusion)). 于 2026-06-15 检索自 https://scholargate.app/zh/deep-learning/multimodal-image-classification · 数据集: https://doi.org/10.5281/zenodo.20539026