InternViT-6B + QLLaMA, can be used for image-text retrieval like CLIP

#5
by vitvit - opened

Can you provide an example? (using text and image)

OpenGVLab org

Hi, please see the quick start section in the model card.

https://huggingface.co/OpenGVLab/InternVL-14B-224px#quick-start

It is not clear. It specifies how to load image encoder but not the fext encoder

I agree with vitvit. Is there a way to we get CLIP like embeddings out of the model that could be indexed to a vector database to be searched upon later?

OpenGVLab org

Below is a complete example that shows how to load the model and obtain the image and text embeddings separately. Note that the prefix 'summarize:' for the text input and setting tokenizer.pad_token_id = 0 are necessary—omitting these may lead to abnormal results.

import torch
from PIL import Image
from transformers import AutoModel, CLIPImageProcessor, AutoTokenizer

# 1. Load the model, image processor, and tokenizer
model = AutoModel.from_pretrained(
    'OpenGVLab/InternVL-14B-224px',
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).cuda().eval()

image_processor = CLIPImageProcessor.from_pretrained('OpenGVLab/InternVL-14B-224px')

tokenizer = AutoTokenizer.from_pretrained(
    'OpenGVLab/InternVL-14B-224px',
    use_fast=False,
    add_eos_token=True
)
tokenizer.pad_token_id = 0  # Set pad_token_id to 0; necessary to avoid abnormal results

# 2. Prepare input data
# Load an image and convert it to RGB
image = Image.open('./examples/image1.jpg').convert('RGB')
pixel_values = image_processor(images=image, return_tensors='pt').pixel_values
pixel_values = pixel_values.to(torch.bfloat16).cuda()

# Prepare text input with the necessary prefix
text = "summarize: a photo of a red panda"
input_ids = tokenizer(text, return_tensors='pt', max_length=80,
                      truncation=True, padding='max_length').input_ids.cuda()

# 3. Get image and text embeddings separately
# Get image embeddings using the model's encode_image method
image_embeds = model.encode_image(pixel_values)

# Get text embeddings using the model's encode_text method
text_embeds = model.encode_text(input_ids)

# Print the shapes of the embeddings as an example
print("Image embeddings shape:", image_embeds.shape)
print("Text embeddings shape:", text_embeds.shape)

Explanation

  • Model Loading:
    The model is loaded using AutoModel.from_pretrained with the trust_remote_code=True flag to load custom model code. The model is then moved to the GPU and set to evaluation mode.

  • Image Processing:
    The CLIPImageProcessor preprocesses the image (converted to RGB) to generate pixel_values, which are then moved to the GPU.

  • Text Processing:
    The AutoTokenizer tokenizes the input text. Note that the prefix 'summarize:' is added to the text (as required by the model), and tokenizer.pad_token_id is explicitly set to 0. Both steps are crucial for correct processing.

  • Embedding Extraction:
    The model's encode_image and encode_text methods are used to obtain the CLIP-style embeddings. These normalized embeddings can be used directly for vector indexing or similarity search.

Sign up or log in to comment