Model Details

VisMin-CLIP is a fine-tuned version of the pretrained CLIP model, designed to enhance fine-grained and compositional abilities beyond the base model. Fine-tuning was conducted using the OpenCLIP library, an open-source implementation of OpenAI’s CLIP.

Model Summary

Usage

Similar to any OpenCLIP model can easily be loaded from the checkpoint:

import open_clip

model_cls_name = "ViT-L-14"
checkpoint_path = "path/to/checkpoint"
model, _, preprocess = open_clip.create_model_and_transforms(
    model_name=model_cls_name, pretrained=checkpoint_path, device=device
)
tokenizer = open_clip.get_tokenizer(model_cls_name)

model = model.to(device).eval()

Once loaded, you can encode the image and text to do zero-shot image classification:

import torch
from PIL import Image

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs) 

Bibtex

If you use VisMin-CLIP in your work, please cite it as follows:

 @article{vismin2024,
    title={VisMin: Visual Minimal-Change Understanding},
    author={Awal, Rabiul and Ahmadi, Saba and Zhang, Le and Agrawal, Aishwarya},
    year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Collection including mair-lab/vismin-clip-vit-large