antoinelouis commited on
Commit
0b332c6
1 Parent(s): 83e269b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +117 -0
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ language: fr
4
+ license: mit
5
+ datasets:
6
+ - unicamp-dl/mmarco
7
+ metrics:
8
+ - recall
9
+ tags:
10
+ - passage-retrieval
11
+ library_name: transformers
12
+ base_model: almanach/camembert-base
13
+ model-index:
14
+ - name: spladev2-camembert-base-mmarcoFR
15
+ results:
16
+ - task:
17
+ type: sentence-similarity
18
+ name: Passage Retrieval
19
+ dataset:
20
+ type: unicamp-dl/mmarco
21
+ name: mMARCO-fr
22
+ config: french
23
+ split: validation
24
+ metrics:
25
+ - type: recall_at_1000
26
+ name: Recall@1000
27
+ value: 89.86
28
+ - type: recall_at_500
29
+ name: Recall@500
30
+ value: 85.96
31
+ - type: recall_at_100
32
+ name: Recall@100
33
+ value: 73.94
34
+ - type: recall_at_10
35
+ name: Recall@10
36
+ value: 46.33
37
+ - type: map_at_10
38
+ name: MAP@10
39
+ value: 24.15
40
+ - type: ndcg_at_10
41
+ name: nDCG@10
42
+ value: 29.58
43
+ - type: mrr_at_10
44
+ name: MRR@10
45
+ value: 24.68
46
+ ---
47
+
48
+ # spladev2-camembert-base-mmarcoFR
49
+
50
+ This is a [SPLADE-max](https://doi.org/10.48550/arXiv.2109.10086) model for **French** that can be used for semantic search. The model maps queries and passages to
51
+ 32k-dimensional sparse vectors which are used to compute relevance through cosine similarity.
52
+
53
+ ## Usage
54
+
55
+ Start by installing the [library](https://huggingface.co/docs/transformers): `pip install -U transformers`. Then, you can use the model like this:
56
+
57
+ ```python
58
+ import torch
59
+ from transformers import AutoTokenizer, AutoModel
60
+
61
+ queries = ["Ceci est un exemple de requête.", "Voici un second exemple."]
62
+ passages = ["Ceci est un exemple de passage.", "Et voilà un deuxième exemple."]
63
+
64
+ tokenizer = AutoTokenizer.from_pretrained('antoinelouis/spladev2-camembert-base-mmarcoFR')
65
+ model = AutoModel.from_pretrained('antoinelouis/spladev2-camembert-base-mmarcoFR')
66
+
67
+ q_input = tokenizer(queries, padding=True, truncation=True, return_tensors='pt')
68
+ p_input = tokenizer(passages, padding=True, truncation=True, return_tensors='pt')
69
+
70
+ with torch.no_grad():
71
+ q_output = model(**q_input)
72
+ p_output = model(**p_input)
73
+
74
+ q_activations = torch.amax(torch.log1p(input=self.relu(q_output.logits * q_input['attention_mask'].unsqueeze(-1))), dim=1)
75
+ p_activations = torch.amax(torch.log1p(input=self.relu(p_output.logits * p_input['attention_mask'].unsqueeze(-1))), dim=1)
76
+
77
+ q_activations = torch.nn.functional.normalize(q_activations, p=2, dim=1)
78
+ p_activations = torch.nn.functional.normalize(p_activations, p=2, dim=1)
79
+
80
+ similarity = q_embeddings @ p_embeddings.T
81
+ print(similarity)
82
+ ```
83
+
84
+ ## Evaluation
85
+
86
+ The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for a corpus of
87
+ 8.8M candidate passages. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
88
+ To see how it compares to other neural retrievers in French, check out the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
89
+
90
+ ## Training
91
+
92
+ #### Data
93
+
94
+ The model is trained on the French training samples of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO that
95
+ contains 8.8M passages and 539K training queries. We sample 12.8M (q, p+, p-) triples from the official ~39.8M [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset)
96
+ with BM25 negatives.
97
+
98
+ #### Implementation
99
+
100
+ The model is initialized from the [almanach/camembert-base](https://huggingface.co/almanach/camembert-base) checkpoint and optimized via a combination of the InfoNCE
101
+ ranking loss with a temperature of 0.05 and the FLOPS regularization loss with quadratic increase of lambda until step 33k after which it remains constant with lambda_q
102
+ = 3e-4 and lambda_d = 1e-4. The model is fine-tuned on one 80GB NVIDIA H100 GPU for 100k steps using the AdamW optimizer with a batch size of 128, a peak learning rate
103
+ of 2e-5 with warm up along the first 4000 steps and linear scheduling. The maximum sequence lengths for questions and passages length were fixed to 32 and 128 tokens.
104
+ Relevance scores are computed with the cosine similarity.
105
+
106
+ ## Citation
107
+
108
+ ```bibtex
109
+ @online{louis2024decouvrir,
110
+ author = 'Antoine Louis',
111
+ title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
112
+ publisher = 'Hugging Face',
113
+ month = 'mar',
114
+ year = '2024',
115
+ url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
116
+ }
117
+ ```