google
/

siglip2-so400m-patch16-384

Zero-Shot Image Classification

Inference Endpoints

Model card Files Files and versions Community

siglip2-so400m-patch16-384 / README.md

ariG23498's picture

ariG23498 HF staff

Upload README.md with huggingface_hub

1f54c20 verified 6 days ago

|

2.46 kB

	---
	license: apache-2.0
	tags:
	- vision
	---

	# SigLIP 2 So400m

	[SigLIP 2](https://huggingface.co/collections/google/siglip2-67b5dcef38c175486e240107)
	extends the pretraining objective of
	[SigLIP](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba)
	with prior, independently developed techniques into a unified recipe, for improved semantic
	understanding, localization, and dense features.

	## Intended uses

	You can use the raw model for tasks like zero-shot image classification and
	image-text retrieval, or as a vision encoder for VLMs (and other vision tasks).

	Here is how to use this model to perform zero-shot image classification:

	```python
	from transformers import pipeline

	# load pipeline
	ckpt = "google/siglip2-so400m-patch16-384"
	image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")

	# load image and candidate labels
	url = "http://images.cocodataset.org/val2017/000000039769.jpg"
	candidate_labels = ["2 cats", "a plane", "a remote"]

	# run inference
	outputs = image_classifier(image, candidate_labels)
	print(outputs)
	```

	You can encode an image using the Vision Tower like so:

	```python
	import torch
	from transformers import AutoModel, AutoProcessor
	from transformers.image_utils import load_image

	# load the model and processor
	ckpt = "google/siglip2-so400m-patch16-384"
	model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
	processor = AutoProcessor.from_pretrained(ckpt)

	# load the image
	image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
	inputs = processor(images=[image], return_tensors="pt").to(model.device)

	# run infernece
	with torch.no_grad():
	image_embeddings = model.get_image_features(**inputs)

	print(image_embeddings.shape)
	```

	For more code examples, we refer to the [siglip documentation](https://huggingface.co/transformers/main/model_doc/siglip.html#).

	## Training procedure

	SigLIP 2 adds some clever training objectives on top of SigLIP:

	1. Decoder loss
	2. Global-local and masked prediction loss
	3. Aspect ratio and resolution adaptibility

	### Training data

	SigLIP 2 is pre-trained on the WebLI dataset [(Chen et al., 2023)](https://arxiv.org/abs/2209.06794).

	### Compute

	The model was trained on up to 2048 TPU-v5e chips.

	## Evaluation results

	Evaluation of SigLIP 2 is shown below (taken from the paper).

	[Evaluation Table](TODO)

	### BibTeX entry and citation info

	```bibtex
	TODO
	```