Zero-Shot Image Classification
Transformers
Safetensors
siglip
vision
Inference Endpoints

Is SiglipImageProcessor configured correctly?

#9
by karby - opened

Hey, just checking if you are sure the current image scaling behavior is correct.

At the moment input images are scaled down ignoring the image aspect of the source, in contrast to Clip variants like clip-vit-large-patch14 or laion-CLIP-ViT-L-14-laion2B-s32B-b82K for example. Is that on purpose? Other implementations seem not to be in line with this.

original:
cinemascope.jpg

siglip:
siglip-so400m-patch14-384.jpg

clip-336:
clip-vit-large-patch14-336.jpg

laion-vit-h-14:
laion--CLIP-ViT-H-14-laion2B-s32B-b79K.jpg

Let me know what you think.

PS: I think the configuration of the ImageProcessor scaling mode could be a bit less obscure. If anyone would fancy a rewrite to make this nicer I might not stop him.
EDIT: That's because scaling mode for SiglipImageProcessor can't be configured, size={"shortest_edge": 384} is not accepted.

yep that's intended, this is how we trained siglip and how I think it should be used.

So I assume it can't tell thicc from thin? Is the need to choose between loosing information, like in cropping or padding, and distorting the input considered an issue?

Of course it can, and you can too, when you look at a distorted image :)

I believe you but gotta go, my tinder date is yelling from her mobility scooter.

Ha!

Screenshot from 2025-03-09 23-17-16.png

Don't get mad, I'm just kidding

Sign up or log in to comment