ColQwenStella-2b-multilingual: Multilingual Visual Retriever based on the combination of Qwen2 Vision and stella_en_1.5B_v5 model.

Ranked #1 among models <= 2B parameters and #8 overall on the Vidore benchmark (as of February 11, 2025). The reported scores on the Vidore Leaderboard correspond to checkpoint-1800.

This is the base version trained on 4xA100 80GB with per_device_batch_size=128 for 5 epoch.

The ColQwenStella-2b-multilingual architecture combines the Vision component of the Qwen2 model with stella_en_1.5B_v5 as its embedding model. Training is done following the ColPali: Efficient Document Retrieval with Vision Language Models recipe.

Data

  • Synthetic data: Selected and preprocessed from the openbmb/VisRAG-Ret-Train-Synthetic-data dataset.
  • In-domain VQA dataset: Drawn from openbmb/VisRAG-Ret-Train-In-domain-data.
  • Docmatix dataset: Extracted from the Metric-AI/rag_docmatix_100k dataset.
  • Colpali dataset: Taken from vidore/colpali_train_set.
  • Multilingual dataset: Taken from llamaindex/vdr-multilingual-train.

Model Training

Parameters

We train models use low-rank adapters (LoRA) with alpha=128 and r=128 on the transformer layers from the language model, and mlp layers of the vison_model.merger as well as the final randomly initialized projection layer, and use a adamw optimizer. We train on an 4xA100 GPU setup with distributed data parallelism (via accelerate), a learning rate of 5e-4 with cosine decay with 100 warmup steps, batch size per device is 128, in bfloat16 format.

Installation

pip install transformers>=4.46.3

Usage

import torch
from PIL import Image

from transformers import AutoModel, AutoProcessor

model = AutoModel.from_pretrained(
        "Metric-AI/ColQwenStella-2b-multilingual",
        torch_dtype=torch.bfloat16,
        device_map="cuda:0",  # or "mps" if on Apple Silicon
        trust_remote_code=True
    ).eval()
processor = AutoProcessor.from_pretrained("Metric-AI/ColQwenStella-2b-multilingual", trust_remote_code=True)

# Your inputs
images = [
    Image.new("RGB", (32, 32), color="white"),
    Image.new("RGB", (16, 16), color="black"),
]
queries = [
    "Is attention really all you need?",
    "What is the amount of bananas farmed in Salvador?",
]

# Process the inputs
batch_images = processor.process_images(images).to(model.device)
batch_queries = processor.process_queries(queries).to(model.device)

# Forward pass
with torch.no_grad():
    image_embeddings = model(**batch_images)
    query_embeddings = model(**batch_queries)

scores = processor.score_multi_vector(query_embeddings, image_embeddings)

License

The adapters attached to the model are under MIT license.

Downloads last month
28
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support visual-document-retrieval models for peft library.

Model tree for Metric-AI/ColQwenStella-2b-multilingual

Adapter
(1)
this model

Datasets used to train Metric-AI/ColQwenStella-2b-multilingual

Collection including Metric-AI/ColQwenStella-2b-multilingual