|
--- |
|
pipeline_tag: sentence-similarity |
|
language: fr |
|
license: mit |
|
datasets: |
|
- unicamp-dl/mmarco |
|
metrics: |
|
- recall |
|
tags: |
|
- passage-retrieval |
|
library_name: transformers |
|
base_model: almanach/camembert-base |
|
model-index: |
|
- name: spladev2-camembert-base-mmarcoFR |
|
results: |
|
- task: |
|
type: sentence-similarity |
|
name: Passage Retrieval |
|
dataset: |
|
type: unicamp-dl/mmarco |
|
name: mMARCO-fr |
|
config: french |
|
split: validation |
|
metrics: |
|
- type: recall_at_1000 |
|
name: Recall@1000 |
|
value: 89.86 |
|
- type: recall_at_500 |
|
name: Recall@500 |
|
value: 85.96 |
|
- type: recall_at_100 |
|
name: Recall@100 |
|
value: 73.94 |
|
- type: recall_at_10 |
|
name: Recall@10 |
|
value: 46.33 |
|
- type: map_at_10 |
|
name: MAP@10 |
|
value: 24.15 |
|
- type: ndcg_at_10 |
|
name: nDCG@10 |
|
value: 29.58 |
|
- type: mrr_at_10 |
|
name: MRR@10 |
|
value: 24.68 |
|
--- |
|
|
|
# spladev2-camembert-base-mmarcoFR |
|
|
|
This is a [SPLADE-max](https://doi.org/10.48550/arXiv.2109.10086) model for **French** that can be used for semantic search. The model maps queries and passages to |
|
32k-dimensional sparse vectors which are used to compute relevance through cosine similarity. |
|
|
|
## Usage |
|
|
|
Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this: |
|
|
|
```python |
|
import torch |
|
from torch.nn.functional import relu, normalize |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
queries = ["Ceci est un exemple de requête.", "Voici un second exemple."] |
|
passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."] |
|
|
|
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/spladev2-camembert-base-mmarcoFR') |
|
model = AutoModel.from_pretrained('antoinelouis/spladev2-camembert-base-mmarcoFR') |
|
|
|
q_input = tokenizer(queries, padding=True, truncation=True, return_tensors='pt') |
|
p_input = tokenizer(passages, padding=True, truncation=True, return_tensors='pt') |
|
|
|
with torch.no_grad(): |
|
q_output = model(**q_input) |
|
p_output = model(**p_input) |
|
|
|
q_activations = torch.amax(torch.log1p(relu(q_output.logits * q_input['attention_mask'].unsqueeze(-1))), dim=1) |
|
p_activations = torch.amax(torch.log1p(relu(p_output.logits * p_input['attention_mask'].unsqueeze(-1))), dim=1) |
|
|
|
q_activations = normalize(q_activations, p=2, dim=1) |
|
p_activations = normalize(p_activations, p=2, dim=1) |
|
|
|
similarity = q_embeddings @ p_embeddings.T |
|
print(similarity) |
|
``` |
|
|
|
## Evaluation |
|
|
|
The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of |
|
8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k). |
|
To see how it compares to other neural retrievers in French, check out the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard. |
|
|
|
## Training |
|
|
|
#### Data |
|
|
|
The model is trained on the French training samples of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO that |
|
contains 8.8M passages and 539K training queries. We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset) |
|
with BM25 negatives. |
|
|
|
#### Implementation |
|
|
|
The model is initialized from the [almanach/camembert-base](https://huggingface.co/almanach/camembert-base) checkpoint and optimized via a combination of the InfoNCE |
|
ranking loss with a temperature of 0.05 and the FLOPS regularization loss with quadratic increase of lambda until step 33k after which it remains constant with lambda_q=3e-4 |
|
and lambda_d=1e-4. The model is fine-tuned on one 80GB NVIDIA H100 GPU for 100k steps using the AdamW optimizer with a batch size of 128, a peak learning rate |
|
of 2e-5 with warm up along the first 4000 steps and linear scheduling. The maximum sequence lengths for questions and passages length were fixed to 32 and 128 tokens. |
|
Relevance scores are computed with the cosine similarity. |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@online{louis2024decouvrir, |
|
author = 'Antoine Louis', |
|
title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French', |
|
publisher = 'Hugging Face', |
|
month = 'mar', |
|
year = '2024', |
|
url = 'https://huggingface.co/spaces/antoinelouis/decouvrir', |
|
} |
|
``` |