--- license: mit --- # CLIP ViT-B/32 xlm roberta base - LAION-5B [CLIP ViT-B/32 xlm roberta base - LAION-5B](https://huggingface.co/laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k) model converted from OpenCLIP to HuggingFace Transformers. See https://gist.github.com/calpt/8e3555bd11f1916b5169c8125117e5ee for conversion script and more info. ## Usage Model uses custom code. Make sure to pass `trust_remote_code=True` when loading the model. Example: ```python import torch from PIL import Image from transformers import AutoModel, AutoFeatureExtractor, AutoTokenizer model = AutoModel.from_pretrained("calpt/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k", trust_remote_code=True) processor = AutoFeatureExtractor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K") tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-base") image_input = processor(Image.open("CLIP.png"), return_tensors="pt") text_input = tokenizer(["a diagram", "a dog", "a cat"], return_tensors="pt", padding=True) with torch.no_grad(): outputs = model(**image_input, **text_input) text_probs = (100.0 * outputs.logits_per_image.softmax(dim=-1)) print("Label probs:", text_probs) ```