--- license: apache-2.0 tags: - vision --- # SigLIP 2 So400m [SigLIP 2](https://huggingface.co/collections/google/siglip2-67b5dcef38c175486e240107) extends the pretraining objective of [SigLIP](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba) with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features. ## Intended uses You can use the raw model for tasks like zero-shot image classification and image-text retrieval, or as a vision encoder for VLMs (and other vision tasks). Here is how to use this model to perform zero-shot image classification: ```python from transformers import pipeline # load pipeline ckpt = "google/siglip2-so400m-patch16-512" image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification") # load image and candidate labels url = "http://images.cocodataset.org/val2017/000000039769.jpg" candidate_labels = ["2 cats", "a plane", "a remote"] # run inference outputs = image_classifier(image, candidate_labels) print(outputs) ``` You can encode an image using the Vision Tower like so: ```python import torch from transformers import AutoModel, AutoProcessor from transformers.image_utils import load_image # load the model and processor ckpt = "google/siglip2-so400m-patch16-512" model = AutoModel.from_pretrained(ckpt, device_map="auto").eval() processor = AutoProcessor.from_pretrained(ckpt) # load the image image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg") inputs = processor(images=[image], return_tensors="pt").to(model.device) # run infernece with torch.no_grad(): image_embeddings = model.get_image_features(**inputs) print(image_embeddings.shape) ``` For more code examples, we refer to the [siglip documentation](https://huggingface.co/transformers/main/model_doc/siglip.html#). ## Training procedure SigLIP 2 adds some clever training objectives on top of SigLIP: 1. Decoder loss 2. Global-local and masked prediction loss 3. Aspect ratio and resolution adaptibility ### Training data SigLIP 2 is pre-trained on the WebLI dataset [(Chen et al., 2023)](https://arxiv.org/abs/2209.06794). ### Compute The model was trained on up to 2048 TPU-v5e chips. ## Evaluation results Evaluation of SigLIP 2 is shown below (taken from the paper). [Evaluation Table](TODO) ### BibTeX entry and citation info ```bibtex TODO ```