CLIP
In the paper titled "Learning Transferable Visual Models From Natural Language Supervision," OpenAI introduces CLIP, short for Contrastive Language-Image Pre-training. This model learns how sentences and images are related, retrieving the most relevant images for a given sentence during training. What sets CLIP apart is its training on complete sentences instead of individual categories like cars or dogs. This approach allows the model to learn more and discover patterns between images and text. When trained on a large dataset of images and their corresponding texts, CLIP can also function as a classifier, outperforming models trained directly on ImageNet for classification tasks. Further exploration of the paper reveals in-depth details and astonishing outcomes.
Useage
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("SRDdev/CLIP")
processor = CLIPProcessor.from_pretrained("SRDdev/CLIP")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities