--- license: cc-by-nc-4.0 datasets: - openbmb/VisRAG-Ret-Train-Synthetic-data - openbmb/VisRAG-Ret-Train-In-domain-data - Metric-AI/rag_docmatix_100k - vidore/colpali_train_set - llamaindex/vdr-multilingual-train - Metric-AI/tabfquad_train_set language: - en - fr - es - it - de base_model: - Metric-AI/ColQwenStella-base-2b - Qwen/Qwen2-VL-2B - NovaSearch/stella_en_1.5B_v5 tags: - vidore - multimodal_embedding - multilingual_embedding - Text-to-Visual Document (T→VD) retrieval library_name: peft pipeline_tag: visual-document-retrieval --- # ColQwenStella-2b-multilingual: Multilingual Visual Retriever based on the combination of Qwen2 Vision and stella_en_1.5B_v5 model. ## Ranked #1 among models <= 2B parameters and #8 overall on the Vidore benchmark (as of February 11, 2025). The reported scores on the [Vidore Leaderboard](https://huggingface.co/spaces/vidore/vidore-leaderboard) correspond to checkpoint-1800. ### This is the base version trained on 4xA100 80GB with per_device_batch_size=128 for 5 epoch. The ColQwenStella-2b-multilingual architecture combines the Vision component of the Qwen2 model with stella_en_1.5B_v5 as its embedding model. Training is done following the [ColPali: Efficient Document Retrieval with Vision Language Models](https://arxiv.org/abs/2407.01449) recipe. ## Data - **Synthetic data**: Selected and preprocessed from the `openbmb/VisRAG-Ret-Train-Synthetic-data` dataset. - **In-domain VQA dataset**: Drawn from `openbmb/VisRAG-Ret-Train-In-domain-data`. - **Docmatix dataset**: Extracted from the `Metric-AI/rag_docmatix_100k` dataset. - **Colpali dataset**: Taken from `vidore/colpali_train_set`. - **Multilingual dataset**: Taken from `llamaindex/vdr-multilingual-train`. ## Model Training ### Parameters We train models use low-rank adapters ([LoRA](https://arxiv.org/abs/2106.09685)) with `alpha=128` and `r=128` on the transformer layers from the language model, and `mlp` layers of the `vison_model.merger` as well as the final randomly initialized projection layer, and use a `adamw` optimizer. We train on an 4xA100 GPU setup with distributed data parallelism (via accelerate), a learning rate of 5e-4 with cosine decay with 100 warmup steps, batch size per device is 128, in `bfloat16` format. ## Installation ```bash pip install transformers>=4.46.3 ``` ## Usage ```python import torch from PIL import Image from transformers import AutoModel, AutoProcessor model = AutoModel.from_pretrained( "Metric-AI/ColQwenStella-2b-multilingual", torch_dtype=torch.bfloat16, device_map="cuda:0", # or "mps" if on Apple Silicon trust_remote_code=True ).eval() processor = AutoProcessor.from_pretrained("Metric-AI/ColQwenStella-2b-multilingual", trust_remote_code=True) # Your inputs images = [ Image.new("RGB", (32, 32), color="white"), Image.new("RGB", (16, 16), color="black"), ] queries = [ "Is attention really all you need?", "What is the amount of bananas farmed in Salvador?", ] # Process the inputs batch_images = processor.process_images(images).to(model.device) batch_queries = processor.process_queries(queries).to(model.device) # Forward pass with torch.no_grad(): image_embeddings = model(**batch_images) query_embeddings = model(**batch_queries) scores = processor.score_multi_vector(query_embeddings, image_embeddings) ``` ## License The adapters attached to the model are under MIT license. - **Developed by:** [Metric AI Research Lab](https://metric.am/)