ColPali
Safetensors
English
vidore
Tz-Man4 commited on
Commit
c41f99f
1 Parent(s): 1c18a08

Delete README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -137
README.md DELETED
@@ -1,137 +0,0 @@
1
- ---
2
- license: mit
3
- library_name: colpali
4
- language:
5
- - en
6
- tags:
7
- - vidore
8
- ---
9
- # ColPali: Visual Retriever based on PaliGemma-3B with ColBERT strategy
10
-
11
- ColPali is a model based on a novel model architecture and training strategy based on Vision Language Models (VLMs) to efficiently index documents from their visual features.
12
- It is a [PaliGemma-3B](https://huggingface.co/google/paligemma-3b-mix-448) extension that generates [ColBERT](https://arxiv.org/abs/2004.12832)- style multi-vector representations of text and images.
13
- It was introduced in the paper [ColPali: Efficient Document Retrieval with Vision Language Models](https://arxiv.org/abs/2407.01449) and first released in [this repository](https://github.com/ManuelFay/colpali)
14
-
15
- ## Model Description
16
-
17
- This model is built iteratively starting from an off-the-shelf [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) model.
18
- We finetuned it to create [BiSigLIP](https://huggingface.co/vidore/bisiglip) and fed the patch-embeddings output by SigLIP to an LLM, [PaliGemma-3B](https://huggingface.co/google/paligemma-3b-mix-448) to create [BiPali](https://huggingface.co/vidore/bipali).
19
-
20
- One benefit of inputting image patch embeddings through a language model is that they are natively mapped to a latent space similar to textual input (query).
21
- This enables leveraging the [ColBERT](https://arxiv.org/abs/2004.12832) strategy to compute interactions between text tokens and image patches, which enables a step-change improvement in performance compared to BiPali.
22
-
23
- ## Model Training
24
-
25
- ### Dataset
26
- Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%).
27
- Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both [*ViDoRe*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) and in the train set to prevent evaluation contamination.
28
- A validation set is created with 2% of the samples to tune hyperparameters.
29
-
30
- *Note: Multilingual data is present in the pretraining corpus of the language model (Gemma-2B) and potentially occurs during PaliGemma-3B's multimodal training.*
31
-
32
- ### Parameters
33
-
34
- All models are trained for 1 epoch on the train set. Unless specified otherwise, we train models in `bfloat16` format, use low-rank adapters ([LoRA](https://arxiv.org/abs/2106.09685))
35
- with `alpha=32` and `r=32` on the transformer layers from the language model,
36
- as well as the final randomly initialized projection layer, and use a `paged_adamw_8bit` optimizer.
37
- We train on an 8 GPU setup with data parallelism, a learning rate of 5e-5 with linear decay with 2.5% warmup steps, and a batch size of 32.
38
-
39
- ## Usage
40
-
41
- ```python
42
- import torch
43
- import typer
44
- from torch.utils.data import DataLoader
45
- from tqdm import tqdm
46
- from transformers import AutoProcessor
47
- from PIL import Image
48
-
49
- from colpali_engine.models.paligemma_colbert_architecture import ColPali
50
- from colpali_engine.trainer.retrieval_evaluator import CustomEvaluator
51
- from colpali_engine.utils.colpali_processing_utils import process_images, process_queries
52
- from colpali_engine.utils.image_from_page_utils import load_from_dataset
53
-
54
-
55
- def main() -> None:
56
- """Example script to run inference with ColPali"""
57
-
58
- # Load model
59
- model_name = "vidore/colpali"
60
- model = ColPali.from_pretrained("google/paligemma-3b-mix-448", torch_dtype=torch.bfloat16, device_map="cuda").eval()
61
- model.load_adapter(model_name)
62
- processor = AutoProcessor.from_pretrained(model_name)
63
-
64
- # select images -> load_from_pdf(<pdf_path>), load_from_image_urls(["<url_1>"]), load_from_dataset(<path>)
65
- images = load_from_dataset("vidore/docvqa_test_subsampled")
66
- queries = ["From which university does James V. Fiorca come ?", "Who is the japanese prime minister?"]
67
-
68
- # run inference - docs
69
- dataloader = DataLoader(
70
- images,
71
- batch_size=4,
72
- shuffle=False,
73
- collate_fn=lambda x: process_images(processor, x),
74
- )
75
- ds = []
76
- for batch_doc in tqdm(dataloader):
77
- with torch.no_grad():
78
- batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
79
- embeddings_doc = model(**batch_doc)
80
- ds.extend(list(torch.unbind(embeddings_doc.to("cpu"))))
81
-
82
- # run inference - queries
83
- dataloader = DataLoader(
84
- queries,
85
- batch_size=4,
86
- shuffle=False,
87
- collate_fn=lambda x: process_queries(processor, x, Image.new("RGB", (448, 448), (255, 255, 255))),
88
- )
89
-
90
- qs = []
91
- for batch_query in dataloader:
92
- with torch.no_grad():
93
- batch_query = {k: v.to(model.device) for k, v in batch_query.items()}
94
- embeddings_query = model(**batch_query)
95
- qs.extend(list(torch.unbind(embeddings_query.to("cpu"))))
96
-
97
- # run evaluation
98
- retriever_evaluator = CustomEvaluator(is_multi_vector=True)
99
- scores = retriever_evaluator.evaluate(qs, ds)
100
- print(scores.argmax(axis=1))
101
-
102
-
103
- if __name__ == "__main__":
104
- typer.run(main)
105
-
106
- ```
107
-
108
- ## Limitations
109
-
110
- - **Focus**: The model primarily focuses on PDF-type documents and high-ressources languages, potentially limiting its generalization to other document types or less represented languages.
111
- - **Support**: The model relies on multi-vector retreiving derived from the ColBERT late interaction mechanism, which may require engineering efforts to adapt to widely used vector retrieval frameworks that lack native multi-vector support.
112
-
113
- ## License
114
-
115
- ColPali's vision language backbone model (PaliGemma) is under `gemma` license as specified in its [model card](https://huggingface.co/google/paligemma-3b-mix-448). The adapters attached to the model are under MIT license.
116
-
117
- ## Contact
118
-
119
- - Manuel Faysse: [email protected]
120
- - Hugues Sibille: [email protected]
121
- - Tony Wu: [email protected]
122
-
123
- ## Citation
124
-
125
- If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
126
-
127
- ```bibtex
128
- @misc{faysse2024colpaliefficientdocumentretrieval,
129
- title={ColPali: Efficient Document Retrieval with Vision Language Models},
130
- author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
131
- year={2024},
132
- eprint={2407.01449},
133
- archivePrefix={arXiv},
134
- primaryClass={cs.IR},
135
- url={https://arxiv.org/abs/2407.01449},
136
- }
137
- ```