Machine learningDeep learning / NLP / CV
多模态图像分类
多模态图像分类通过整合除图像特征外的其他模态(如文本描述、音频或结构化元数据)来扩展标准的视觉分类。独立的编码器处理每种模态,然后融合它们的表示,最后由联合分类器分配目标标签。CLIP等模型表明,图像-文本对齐能够实现大规模的零样本和少样本图像分类。
阅读完整方法
仅限会员
登录使用免费账户登录即可阅读本节。
Method map
The neighbourhood of related methods — select a node to explore.
来源
- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. Proceedings of the 38th International Conference on Machine Learning (ICML), PMLR 139, 8748–8763. link ↗
- Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., & Ng, A. Y. (2011). Multimodal deep learning. Proceedings of the 28th International Conference on Machine Learning (ICML), 689–696. link ↗
如何引用本页
ScholarGate. (2026, June 3). Multimodal Image Classification (Vision + Auxiliary Modality Fusion). ScholarGate. https://scholargate.app/zh/deep-learning/multimodal-image-classification
Which method?
Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.
- 微调图像分类深度学习↔ compare
- 图像分类深度学习↔ compare
- 多模态BERT分类深度学习↔ compare
- 多模态目标检测深度学习↔ compare
- 多模态句子嵌入深度学习↔ compare
- 多模态Transformer深度学习↔ compare