--- license: mit language: - ar - kn - ar - ka - af - kk - am - km - ar - ky - ar - ko - as - lo - az - ml - az - mr - be - mk - bn - my - bs - nl - bg - ca - 'no' - cs - ne - ku - pl - cy - pt - da - ro - de - ru - el - sa - en - si - eo - sk - et - sl - eu - sd - fi - so - fr - es - gd - sr - ga - su - gl - sv - gu - sw - ha - ta - he - te - hi - th - hr - tr - hu - ug - hy - uk - id - ur - is - vi - it - xh - jv - zh - ja pipeline_tag: zero-shot-image-classification tags: - siglip - clip - mexma --- ## Model Summary MEXMA-SigLIP is a model that combines the [MEXMA](https://huggingface.co/facebook/MEXMA) multilingual text encoder and an image encoder from the [SigLIP](https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384) model. This allows us to get a high-performance CLIP model for 80 languages. MEXMA-SigLIP sets state-of-the-art on the [Crossmodal-3600](https://google.github.io/crossmodal-3600/) dataset across commercial use-friendly models. ## How to use ``` from transformers import AutoModel, AutoTokenizer, AutoImageProcessor from PIL import Image import requests import torch model = AutoModel.from_pretrained("visheratin/mexma-siglip", torch_dtype=torch.bfloat16, trust_remote_code=True, optimized=True).to("cuda") tokenizer = AutoTokenizer.from_pretrained("visheratin/mexma-siglip") processor = AutoImageProcessor.from_pretrained("visheratin/mexma-siglip") img = Image.open(requests.get("https://static.independent.co.uk/s3fs-public/thumbnails/image/2014/03/25/12/eiffel.jpg", stream=True).raw) img = processor(images=img, return_tensors="pt")["pixel_values"] img = img.to(torch.bfloat16).to("cuda") with torch.inference_mode(): text = tokenizer(["кошка", "a dog", "एफिल टॉवर"], return_tensors="pt", padding=True).to("cuda") image_logits, text_logits = model.get_logits(text["input_ids"], text["attention_mask"], img) probs = image_logits.softmax(dim=-1) print(probs) ``` ## Acknowledgements I thank [ML Collective](https://mlcollective.org/) and [Lambda](https://lambdalabs.com/) for providing compute resources to train the model.