InternViT-6B + QLLaMA, can be used for image-text retrieval like CLIP
Can you provide an example? (using text and image)
Hi, please see the quick start section in the model card.
https://huggingface.co/OpenGVLab/InternVL-14B-224px#quick-start
It is not clear. It specifies how to load image encoder but not the fext encoder
I agree with vitvit. Is there a way to we get CLIP like embeddings out of the model that could be indexed to a vector database to be searched upon later?
Below is a complete example that shows how to load the model and obtain the image and text embeddings separately. Note that the prefix 'summarize:'
for the text input and setting tokenizer.pad_token_id = 0
are necessary—omitting these may lead to abnormal results.
import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor, AutoTokenizer
# 1. Load the model, image processor, and tokenizer
model = AutoModel.from_pretrained(
'OpenGVLab/InternVL-14B-224px',
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).cuda().eval()
image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')
tokenizer = AutoTokenizer.from_pretrained(
'OpenGVLab/InternVL-14B-224px',
use_fast=False,
add_eos_token=True
)
tokenizer.pad_token_id = 0 # Set pad_token_id to 0; necessary to avoid abnormal results
# 2. Prepare input data
# Load an image and convert it to RGB
image = Image.open('./examples/image1.jpg').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()
# Prepare text input with the necessary prefix
text = "summarize: a photo of a red panda"
input_ids = tokenizer(text, return_tensors='pt', max_length=80,
truncation=True, padding='max_length').input_ids.cuda()
# 3. Get image and text embeddings separately
# Get image embeddings using the model's encode_image method
image_embeds = model.encode_image(pixel_values)
# Get text embeddings using the model's encode_text method
text_embeds = model.encode_text(input_ids)
# Print the shapes of the embeddings as an example
print("Image embeddings shape:", image_embeds.shape)
print("Text embeddings shape:", text_embeds.shape)
Explanation
Model Loading:
The model is loaded usingAutoModel.from_pretrained
with thetrust_remote_code=True
flag to load custom model code. The model is then moved to the GPU and set to evaluation mode.Image Processing:
TheCLIPImageProcessor
preprocesses the image (converted to RGB) to generatepixel_values
, which are then moved to the GPU.Text Processing:
TheAutoTokenizer
tokenizes the input text. Note that the prefix'summarize:'
is added to the text (as required by the model), andtokenizer.pad_token_id
is explicitly set to 0. Both steps are crucial for correct processing.Embedding Extraction:
The model'sencode_image
andencode_text
methods are used to obtain the CLIP-style embeddings. These normalized embeddings can be used directly for vector indexing or similarity search.