antoinelouis
commited on
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,132 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
pipeline_tag: sentence-similarity
|
3 |
+
language: fr
|
4 |
+
license: mit
|
5 |
+
datasets:
|
6 |
+
- unicamp-dl/mmarco
|
7 |
+
metrics:
|
8 |
+
- recall
|
9 |
+
tags:
|
10 |
+
- feature-extraction
|
11 |
+
- sentence-similarity
|
12 |
+
library_name: colbert
|
13 |
+
inference: false
|
14 |
+
---
|
15 |
+
|
16 |
+
# colbertv2-camembert-L4-mmarcoFR
|
17 |
+
|
18 |
+
This is a lightweight [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488) model for French that can be used for semantic search. It encodes queries & passages into matrices of token-level embeddings and efficiently finds passages that contextually match the query using scalable vector-similarity (MaxSim) operators. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
|
19 |
+
|
20 |
+
## Usage
|
21 |
+
|
22 |
+
Here are some examples for using the model with [colbert-ai](https://github.com/stanford-futuredata/ColBERT) or [RAGatouille](https://github.com/bclavie/RAGatouille).
|
23 |
+
|
24 |
+
### Using ColBERT-AI
|
25 |
+
|
26 |
+
First, you will need to install the following libraries:
|
27 |
+
|
28 |
+
```bash
|
29 |
+
pip install git+https://github.com/stanford-futuredata/ColBERT.git torch faiss-gpu==1.7.2
|
30 |
+
```
|
31 |
+
|
32 |
+
Then, you can use the model like this:
|
33 |
+
|
34 |
+
```python
|
35 |
+
from colbert import Indexer, Searcher
|
36 |
+
from colbert.infra import Run, RunConfig
|
37 |
+
|
38 |
+
n_gpu: int = 1 # Set your number of available GPUs
|
39 |
+
experiment: str = "colbert" # Name of the folder where the logs and created indices will be stored
|
40 |
+
index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
|
41 |
+
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
|
42 |
+
|
43 |
+
# Step 1: Indexing. This step encodes all passages into matrices, stores them on disk, and builds data structures for efficient search.
|
44 |
+
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
45 |
+
indexer = Indexer(checkpoint="antoinelouis/colbertv2-camembert-L4-mmarcoFR")
|
46 |
+
indexer.index(name=index_name, collection=documents)
|
47 |
+
|
48 |
+
# Step 2: Searching. Given the model and index, you can issue queries over the collection to retrieve the top-k passages for each query.
|
49 |
+
with Run().context(RunConfig(nranks=n_gpu,experiment=experiment)):
|
50 |
+
searcher = Searcher(index=index_name) # You don't need to specify checkpoint again, the model name is stored in the index.
|
51 |
+
results = searcher.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
|
52 |
+
# results: tuple of tuples of length k containing ((passage_id, passage_rank, passage_score), ...)
|
53 |
+
```
|
54 |
+
|
55 |
+
### Using RAGatouille
|
56 |
+
|
57 |
+
First, you will need to install the following libraries:
|
58 |
+
|
59 |
+
```bash
|
60 |
+
pip install -U ragatouille
|
61 |
+
```
|
62 |
+
|
63 |
+
Then, you can use the model like this:
|
64 |
+
|
65 |
+
```python
|
66 |
+
from ragatouille import RAGPretrainedModel
|
67 |
+
|
68 |
+
index_name: str = "my_index" # The name of your index, i.e. the name of your vector database
|
69 |
+
documents: list = ["Ceci est un premier document.", "Voici un second document.", "etc."] # Corpus
|
70 |
+
|
71 |
+
# Step 1: Indexing.
|
72 |
+
RAG = RAGPretrainedModel.from_pretrained("antoinelouis/colbertv2-camembert-L4-mmarcoFR")
|
73 |
+
RAG.index(name=index_name, collection=documents)
|
74 |
+
|
75 |
+
# Step 2: Searching.
|
76 |
+
RAG = RAGPretrainedModel.from_index(index_name) # if not already loaded
|
77 |
+
RAG.search(query="Comment effectuer une recherche avec ColBERT ?", k=10)
|
78 |
+
```
|
79 |
+
|
80 |
+
***
|
81 |
+
|
82 |
+
## Evaluation
|
83 |
+
|
84 |
+
The model is evaluated on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared its
|
85 |
+
performance with other publicly available 🇫🇷 ColBERT models (as well as one single-vector representation model) fine-tuned on the same dataset. We report the
|
86 |
+
mean reciprocal rank (MRR) and recall at various cut-offs (R@k).
|
87 |
+
|
88 |
+
| model | #Param.(↓) | Size | Dim. | Index | R@1000 | R@500 | R@100 | R@10 | MRR@10 |
|
89 |
+
|:-----------------------------------------------------------------------------------------------------------|-----------:|------:|-----:|------:|-------:|------:|------:|-----:|-------:|
|
90 |
+
| **colbertv2-camembert-L4-mmarcoFR** | 54M | 216MB | 32 | GB | 91.9 | 90.3 | 81.9 | 56.7 | 32.3 |
|
91 |
+
| [FraColBERTv2](bclavie/FraColBERTv2) | 110M | 443MB | 128 | GB | | | | | |
|
92 |
+
| [colbertv1-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/colbertv1-camembert-base-mmarcoFR) | 110M | 443MB | 128 | GB | 89.7 | 88.4 | 80.0 | 54.2 | 29.5 |
|
93 |
+
| [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 110M | 443MB | 128 | GB | - | 89.1 | 77.8 | 51.5 | 28.5 |
|
94 |
+
|
95 |
+
NB: The index corresponds to the size of the mMARCO-fr index (8.8M passages) on disk.
|
96 |
+
|
97 |
+
***
|
98 |
+
|
99 |
+
## Training
|
100 |
+
|
101 |
+
#### Data
|
102 |
+
|
103 |
+
We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of
|
104 |
+
MS MARCO that contains 8.8M passages and 539K training queries. We do not employ the BM25 negatives provided by the official [triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset)
|
105 |
+
but instead sample 62 harder negatives mined from 12 distinct dense retrievers for each query, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#msmarco-hard-negativesjsonlgz)
|
106 |
+
distillation dataset. Next, we collect the relevance scores of an expressive [cross-encoder reranker](https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2)
|
107 |
+
for all our (query, paragraph) pairs using the [cross-encoder-ms-marco-MiniLM-L-6-v2-scores](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#cross-encoder-ms-marco-minilm-l-6-v2-scorespklgz) dataset.
|
108 |
+
Eventually, we end up with 10.4M different 64-way tuples of the form [query, (pos, pos_score), (neg1, neg1_score), ..., (neg62, neg62_score)] for training the model.
|
109 |
+
|
110 |
+
#### Implementation
|
111 |
+
|
112 |
+
The model is initialized from the [camembert-L4](https://huggingface.co/antoinelouis/camembert-L4) checkpoint and optimized via a combination of KL-Divergence loss
|
113 |
+
for distilling the cross-encoder scores into the model with the in-batch sampled softmax cross-entropy loss applied to the positive score of each query against all
|
114 |
+
passages corresponding to other queries in the same batch (as in [ColBERTv2](https://doi.org/10.48550/arXiv.2112.01488)). The model is fine-tuned on one 80GB NVIDIA
|
115 |
+
H100 GPU for 325k steps using the AdamW optimizer with a batch size of 32, a peak learning rate of 1e-5 with warm up along the first 20k steps and linear scheduling.
|
116 |
+
The embedding dimension is set to 32, and the maximum sequence lengths for questions and passages length were fixed to 32 and 160 tokens, respectively. We use
|
117 |
+
the cosine similarity to compute relevance scores.
|
118 |
+
|
119 |
+
***
|
120 |
+
|
121 |
+
## Citation
|
122 |
+
|
123 |
+
```bibtex
|
124 |
+
@online{louis2023,
|
125 |
+
author = 'Antoine Louis',
|
126 |
+
title = 'colbertv2-camembert-L4-mmarcoFR: A Lightweight ColBERTv2 Model for French',
|
127 |
+
publisher = 'Hugging Face',
|
128 |
+
month = 'mar',
|
129 |
+
year = '2024',
|
130 |
+
url = 'https://huggingface.co/antoinelouis/colbertv2-camembert-L4-mmarcoFR',
|
131 |
+
}
|
132 |
+
```
|