You probably do not need this unless you are training your own IP Adapters.

Modified version of the vision encoder of CLIP-ViT-H-14-laion2B-s32B-b79K to handle 448 x 448 inputs vs the original 224 x 224 inputs. It will probbaly not work for classification (as is), but will DIP work for for IP+ adapters that use CLIP-ViT-H, though they will need to be fine tuned a little more.

Hidden layer outputs go from (257, 1280) to (1025, 1280), which can be digested by the Resampler without modification or weight resizing.

Downloads last month
19
Safetensors
Model size
633M params
Tensor type
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.