Machine learning

CLIP — Contrastive Language-Image Pretraining

Contrastive Language-Image Pretraining · Also known as: CLIP, Contrastive Language-Image Pre-training, zero-shot image classifier, visual-language model

CLIP (Contrastive Language-Image Pretraining) is a vision-language model introduced by Radford et al. at OpenAI in 2021 that jointly learns aligned image and text representations by training on 400 million internet-sourced image-text pairs using a contrastive objective, enabling zero-shot transfer to image classification tasks without any task-specific fine-tuning.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

CLIP

ResNet Vision Transformer Multimodal BERT-based Cl…Multimodal Sentence Embe…

When to use it

CLIP is appropriate when labelled training data for the target task are scarce or unavailable, when the set of classes may change at deployment time (open-vocabulary recognition), or when a single pretrained backbone is needed for multiple downstream tasks such as image classification, image-text retrieval, or visual question answering. It assumes access to a pretrained CLIP checkpoint (OpenAI provides several) and that candidate class names can be expressed as natural-language phrases. CLIP does not replace task-specific fine-tuning when labelled data are plentiful — supervised fine-tuning on the target distribution typically outperforms zero-shot CLIP. It is also not designed for dense prediction tasks (detection, segmentation) without additional adaptation.

Strengths & limitations

Strengths

Zero-shot transfer to new image classification tasks without any labelled examples for those tasks.
Open-vocabulary recognition: class sets can be specified or changed at inference time using natural language.
Robust to distribution shift compared with ImageNet-supervised models, because training covers diverse internet images and captions.
Pretrained image and text encoders can be reused as powerful feature extractors for retrieval, ranking, and multimodal downstream tasks.
Scales predictably: both zero-shot accuracy and downstream transfer improve with larger models and more training data.

Limitations

Zero-shot performance lags behind task-specific supervised fine-tuning when labelled data for the target domain are available.
Requires large compute and storage: pretraining CLIP from scratch demands hundreds of GPUs and 400M curated image-text pairs.
Inherits biases present in web-scale training data, which can lead to demographic and cultural biases in predictions.
Struggles with fine-grained or abstract tasks (e.g., counting objects, reading text, spatial reasoning) that are underrepresented in natural captions.
Prompt engineering — the exact wording of class descriptions — meaningfully affects accuracy, making results sensitive to phrasing choices.

Frequently asked

How does CLIP classify images if it was never shown the target classes during training?

CLIP's contrastive objective aligns image and text representations in a shared space. At inference time, each candidate class label is converted into a natural-language prompt and encoded by the text encoder. The image is encoded by the image encoder, and the class whose text embedding has the highest cosine similarity to the image embedding is selected. Because the encoders learned general visual-semantic associations from 400 million diverse pairs, this generalises to unseen classes described in natural language.

What is prompt engineering in the context of CLIP, and why does it matter?

Prompt engineering refers to the choice of text template used to encode class names (e.g., 'a photo of a {class}' vs. simply '{class}'). Radford et al. found that contextual templates consistently improve zero-shot accuracy over bare class names, sometimes by several percentage points. Ensembling multiple prompt templates further boosts performance. The sensitivity to prompt wording reflects the fact that the text encoder was trained on naturally occurring sentences, not isolated nouns.

Is CLIP a generative model?

No. CLIP is a discriminative contrastive model that produces matched image and text embeddings; it does not generate images or text. However, CLIP embeddings have been used to guide generative models such as DALL-E 2, which uses CLIP's embedding space as the semantic target for a diffusion-based image decoder.

When should I fine-tune CLIP rather than use it zero-shot?

Zero-shot CLIP is competitive when labelled data are absent or when the task closely resembles the diversity of web images and captions. Fine-tuning (either the full model or a linear probe) is recommended when labelled data are available for the target domain, especially for fine-grained or domain-specific tasks such as medical imaging, satellite imagery, or product recognition, where zero-shot accuracy can lag significantly behind supervised alternatives.

Sources

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 8748–8763. link ↗
Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020. link ↗
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. ISBN: 978-0-262-03561-3

How to cite this page

ScholarGate. (2026, June 3). Contrastive Language-Image Pretraining. ScholarGate. https://scholargate.app/en/deep-learning/clip

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

ResNetDeep learning↔ compare
Vision TransformerDeep learning↔ compare

Compare side by side →

Referenced by

Multimodal BERT-based Classification Multimodal Sentence Embeddings

Related reference concepts

Object Recognition and Detection Self-Supervised and Representation Learning Neural Language Models and Word Embeddings Image Segmentation Deep Generative Models Deep Learning

Spotted an issue on this page? Report or suggest a fix →

Machine learning

CLIP — Contrastive Language-Image Pretraining

Contrastive Language-Image Pretraining · Also known as: CLIP, Contrastive Language-Image Pre-training, zero-shot image classifier, visual-language model

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

CLIP

ResNet Vision Transformer Multimodal BERT-based Cl…Multimodal Sentence Embe…

When to use it

Strengths & limitations

Strengths

Zero-shot transfer to new image classification tasks without any labelled examples for those tasks.
Open-vocabulary recognition: class sets can be specified or changed at inference time using natural language.
Robust to distribution shift compared with ImageNet-supervised models, because training covers diverse internet images and captions.
Pretrained image and text encoders can be reused as powerful feature extractors for retrieval, ranking, and multimodal downstream tasks.
Scales predictably: both zero-shot accuracy and downstream transfer improve with larger models and more training data.

Limitations

Zero-shot performance lags behind task-specific supervised fine-tuning when labelled data for the target domain are available.
Requires large compute and storage: pretraining CLIP from scratch demands hundreds of GPUs and 400M curated image-text pairs.
Inherits biases present in web-scale training data, which can lead to demographic and cultural biases in predictions.
Struggles with fine-grained or abstract tasks (e.g., counting objects, reading text, spatial reasoning) that are underrepresented in natural captions.
Prompt engineering — the exact wording of class descriptions — meaningfully affects accuracy, making results sensitive to phrasing choices.

Frequently asked

How does CLIP classify images if it was never shown the target classes during training?

What is prompt engineering in the context of CLIP, and why does it matter?

Is CLIP a generative model?

When should I fine-tune CLIP rather than use it zero-shot?

Sources

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of the 38th International Conference on Machine Learning, PMLR 139, 8748–8763. link ↗
Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. arXiv:2103.00020. link ↗
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. ISBN: 978-0-262-03561-3

How to cite this page

ScholarGate. (2026, June 3). Contrastive Language-Image Pretraining. ScholarGate. https://scholargate.app/en/deep-learning/clip

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

ResNetDeep learning↔ compare
Vision TransformerDeep learning↔ compare

Compare side by side →

Referenced by

Multimodal BERT-based Classification Multimodal Sentence Embeddings

Similar methods

Related reference concepts

Object Recognition and Detection Self-Supervised and Representation Learning Neural Language Models and Word Embeddings Image Segmentation Deep Generative Models Deep Learning

Spotted an issue on this page? Report or suggest a fix →