Zero-Shot Image Classification
Transformers
Safetensors
siglip
vision
Inference Endpoints
File size: 2,460 Bytes
03f865d
4b03c98
 
 
03f865d
 
4b03c98
03f865d
4b03c98
 
 
 
 
03f865d
4b03c98
03f865d
4b03c98
 
03f865d
4b03c98
03f865d
4b03c98
 
03f865d
4b03c98
 
 
03f865d
4b03c98
 
 
03f865d
4b03c98
 
 
 
03f865d
4b03c98
03f865d
4b03c98
 
 
 
03f865d
4b03c98
 
 
 
03f865d
4b03c98
 
 
03f865d
4b03c98
 
 
03f865d
4b03c98
 
03f865d
4b03c98
03f865d
4b03c98
03f865d
4b03c98
03f865d
4b03c98
 
 
03f865d
4b03c98
03f865d
4b03c98
03f865d
4b03c98
03f865d
4b03c98
03f865d
4b03c98
03f865d
4b03c98
03f865d
4b03c98
03f865d
4b03c98
03f865d
4b03c98
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
license: apache-2.0
tags:
- vision
---

# SigLIP 2 So400m

[SigLIP 2](https://huggingface.co/collections/google/siglip2-67b5dcef38c175486e240107)
extends the pretraining objective of
[SigLIP](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba)
with prior, independently developed techniques into a unified recipe, for improved semantic
understanding, localization, and dense features.

## Intended uses

You can use the raw model for tasks like zero-shot image classification and
image-text retrieval, or as a vision encoder for VLMs (and other vision tasks).

Here is how to use this model to perform zero-shot image classification:

```python
from transformers import pipeline

# load pipeline
ckpt = "google/siglip2-so400m-patch16-512"
image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")

# load image and candidate labels
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
candidate_labels = ["2 cats", "a plane", "a remote"]

# run inference
outputs = image_classifier(image, candidate_labels)
print(outputs)
```

You can encode an image using the Vision Tower like so:

```python
import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image

# load the model and processor
ckpt = "google/siglip2-so400m-patch16-512"
model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
processor = AutoProcessor.from_pretrained(ckpt)

# load the image
image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
inputs = processor(images=[image], return_tensors="pt").to(model.device)

# run infernece
with torch.no_grad():
    image_embeddings = model.get_image_features(**inputs)    

print(image_embeddings.shape)
```

For more code examples, we refer to the [siglip documentation](https://huggingface.co/transformers/main/model_doc/siglip.html#).

## Training procedure

SigLIP 2 adds some clever training objectives on top of SigLIP:

1. Decoder loss
2. Global-local and masked prediction loss
3. Aspect ratio and resolution adaptibility 

### Training data

SigLIP 2 is pre-trained on the WebLI dataset [(Chen et al., 2023)](https://arxiv.org/abs/2209.06794).

### Compute

The model was trained on up to 2048 TPU-v5e chips.

## Evaluation results

Evaluation of SigLIP 2 is shown below (taken from the paper).

[Evaluation Table](TODO)

### BibTeX entry and citation info

```bibtex
TODO
```