mair-lab/vismin-clip-vit-large

Model Details

VisMin-CLIP is a fine-tuned version of the pretrained CLIP model, designed to enhance fine-grained and compositional abilities beyond the base model. Fine-tuning was conducted using the OpenCLIP library, an open-source implementation of OpenAI’s CLIP.

Model Summary

Model Date: July 2024
Model type: Vision-language Foundation Model (image+text)
Parent Model: openai/clip-vit-large-patch14

Usage

Similar to any OpenCLIP model can easily be loaded from the checkpoint:

import open_clip

model_cls_name = "ViT-L-14"
checkpoint_path = "path/to/checkpoint"
model, _, preprocess = open_clip.create_model_and_transforms(
    model_name=model_cls_name, pretrained=checkpoint_path, device=device
)
tokenizer = open_clip.get_tokenizer(model_cls_name)

model = model.to(device).eval()

Once loaded, you can encode the image and text to do zero-shot image classification:

import torch
from PIL import Image

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

Bibtex

If you use VisMin-CLIP in your work, please cite it as follows:

 @article{vismin2024,
    title={VisMin: Visual Minimal-Change Understanding},
    author={Awal, Rabiul and Ahmadi, Saba and Zhang, Le and Agrawal, Aishwarya},
    year={2024}
}

mair-lab
/

vismin-clip-vit-large

Collection including mair-lab/vismin-clip-vit-large

VisMin