Zero-Shot Image Classification
Transformers
Safetensors
siglip
vision
Inference Endpoints
ariG23498's picture
ariG23498 HF staff
Upload README.md with huggingface_hub
4b03c98 verified
|
raw
history blame
2.46 kB
metadata
license: apache-2.0
tags:
  - vision

SigLIP 2 So400m

SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features.

Intended uses

You can use the raw model for tasks like zero-shot image classification and image-text retrieval, or as a vision encoder for VLMs (and other vision tasks).

Here is how to use this model to perform zero-shot image classification:

from transformers import pipeline

# load pipeline
ckpt = "google/siglip2-so400m-patch16-512"
image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")

# load image and candidate labels
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
candidate_labels = ["2 cats", "a plane", "a remote"]

# run inference
outputs = image_classifier(image, candidate_labels)
print(outputs)

You can encode an image using the Vision Tower like so:

import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image

# load the model and processor
ckpt = "google/siglip2-so400m-patch16-512"
model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
processor = AutoProcessor.from_pretrained(ckpt)

# load the image
image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
inputs = processor(images=[image], return_tensors="pt").to(model.device)

# run infernece
with torch.no_grad():
    image_embeddings = model.get_image_features(**inputs)    

print(image_embeddings.shape)

For more code examples, we refer to the siglip documentation.

Training procedure

SigLIP 2 adds some clever training objectives on top of SigLIP:

  1. Decoder loss
  2. Global-local and masked prediction loss
  3. Aspect ratio and resolution adaptibility

Training data

SigLIP 2 is pre-trained on the WebLI dataset (Chen et al., 2023).

Compute

The model was trained on up to 2048 TPU-v5e chips.

Evaluation results

Evaluation of SigLIP 2 is shown below (taken from the paper).

Evaluation Table

BibTeX entry and citation info

TODO