Extracting `text_encoder` from `ViT-H-14` using `open_clip_torch`?
#9
by
Chanuhf
- opened
I've loaded the pre-trained CLIP model variant ViT-H-14
using open_clip_torch
. While I can get the tokenizer with open_clip.get_tokenizer('ViT-H-14')
, I'm unsure how to extract the text_encoder
.
Can anyone guide me on obtaining the text_encoder
from this model?
For example:
!pip install transformers
from transformers import CLIPTextModel, CLIPTokenizer
tokenizer = CLIPTokenizer.from_pretrained('openai/clip-vit-large-patch14')
text_encoder = CLIPTextModel.from_pretrained('openai/clip-vit-large-patch14')
Expecting:
!pip install open_clip_torch
model, train_transform, eval_transform = open_clip.create_model_and_transforms('ViT-H-14',pretrained='laion2b_s32b_b79k')
tokenizer = open_clip.get_tokenizer('ViT-H-14')
text_encoder = ________________________________
@Chanuhf there is no method that creates the text or image encoder by themselves, but it's easy enough to encode just text (or images), or extract either tower, to extract text tower you want to set the custom text flag so all of the text bits are pushed into their own sub-module
model, train_transform, eval_transform = open_clip.create_model_and_transforms('ViT-H-14', pretrained='laion2b_s32b_b79k', force_custom_text=True)
tokenizer = open_clip.get_tokenizer('ViT-H-14')
text_encoder = model.text
del model
x = text(input)
x = F.normalize(x, dim=-1) # if normalized output desired
model, train_transform, eval_transform = open_clip.create_model_and_transforms('ViT-H-14', pretrained='laion2b_s32b_b79k')
tokenizer = open_clip.get_tokenizer('ViT-H-14')
model.encode_text(text_inputs)