antoinelouis commited on
Commit
5dfb1a9
1 Parent(s): d8ccad3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -25
README.md CHANGED
@@ -10,6 +10,30 @@ tags:
10
  - passage-reranking
11
  library_name: sentence-transformers
12
  base_model: cmarkea/distilcamembert-base
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  ---
14
 
15
  # crossencoder-distilcamembert-mmarcoFR
@@ -75,17 +99,10 @@ print(scores)
75
 
76
  ## Evaluation
77
 
78
- We evaluate the model on 500 random training queries from [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/) (which were excluded from training) by reranking
79
- subsets of candidate passages comprising of at least one relevant and up to 200 BM25 negative passages for each query. Below, we compare the model performance with other
80
- cross-encoder models fine-tuned on the same dataset. We report the R-precision (RP), mean reciprocal rank (MRR), and recall at various cut-offs (R@k).
81
-
82
- | | model | Vocab. | #Param. | Size | RP | MRR@10 | R@10(↑) | R@20 | R@50 | R@100 |
83
- |---:|:-----------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|-------:|---------:|---------:|-------:|-------:|--------:|
84
- | 1 | [crossencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-camembert-base-mmarcoFR) | fr | 110M | 443MB | 35.65 | 50.44 | 82.95 | 91.50 | 96.80 | 98.80 |
85
- | 2 | [crossencoder-mMiniLMv2-L12-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L12-mmarcoFR) | fr,99+ | 118M | 471MB | 34.37 | 51.01 | 82.23 | 90.60 | 96.45 | 98.40 |
86
- | 3 | **crossencoder-distilcamembert-mmarcoFR** | fr | 68M | 272MB | 27.28 | 43.71 | 80.30 | 89.10 | 95.55 | 98.60 |
87
- | 4 | [crossencoder-electra-base-french-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-electra-base-french-mmarcoFR) | fr | 110M | 443MB | 28.32 | 45.28 | 79.22 | 87.15 | 93.15 | 95.75 |
88
- | 5 | [crossencoder-mMiniLMv2-L6-mmarcoFR](https://huggingface.co/antoinelouis/crossencoder-mMiniLMv2-L6-mmarcoFR) | fr,99+ | 107M | 428MB | 33.92 | 49.33 | 79.00 | 88.35 | 94.80 | 98.20 |
89
 
90
  ***
91
 
@@ -94,28 +111,29 @@ cross-encoder models fine-tuned on the same dataset. We report the R-precision (
94
  #### Data
95
 
96
  We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
97
- that contains 8.8M passages and 539K training queries. We sample 1M question-passage pairs from the official ~39.8M
98
- [training triples](https://microsoft.github.io/msmarco/Datasets.html#passage-ranking-dataset) with a positive-to-negative ratio of 4 (i.e., 25% of the pairs are
99
- relevant and 75% are irrelevant).
 
100
 
101
  #### Implementation
102
 
103
  The model is initialized from the [cmarkea/distilcamembert-base](https://huggingface.co/cmarkea/distilcamembert-base) checkpoint and optimized via the binary cross-entropy loss
104
- (as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 32GB NVIDIA V100 GPU for 10 epochs (i.e., 312.4k steps) using the AdamW optimizer
105
- with a batch size of 32, a peak learning rate of 2e-5 with warm up along the first 500 steps and linear scheduling. We set the maximum sequence length of the
106
- concatenated question-passage pairs to 512 tokens. We use the sigmoid function to get scores between 0 and 1.
107
 
108
  ***
109
 
110
  ## Citation
111
 
112
  ```bibtex
113
- @online{louis2023,
114
- author = 'Antoine Louis',
115
- title = 'crossencoder-distilcamembert-mmarcoFR: A Cross-Encoder Model Trained on 1M sentence pairs in French',
116
- publisher = 'Hugging Face',
117
- month = 'september',
118
- year = '2023',
119
- url = 'https://huggingface.co/antoinelouis/crossencoder-distilcamembert-mmarcoFR',
120
  }
121
- ```
 
10
  - passage-reranking
11
  library_name: sentence-transformers
12
  base_model: cmarkea/distilcamembert-base
13
+ model-index:
14
+ - name: crossencoder-distilcamembert-mmarcoFR
15
+ results:
16
+ - task:
17
+ type: text-classification
18
+ name: Passage Rerankingg
19
+ dataset:
20
+ type: unicamp-dl/mmarco
21
+ name: mMARCO-fr
22
+ config: french
23
+ split: validation
24
+ metrics:
25
+ - type: recall_at_500
26
+ name: Recall@500
27
+ value: 96.15
28
+ - type: recall_at_100
29
+ name: Recall@100
30
+ value: 84.39
31
+ - type: recall_at_10
32
+ name: Recall@10
33
+ value: 56.33
34
+ - type: mrr_at_10
35
+ name: MRR@10
36
+ value: 31.86
37
  ---
38
 
39
  # crossencoder-distilcamembert-mmarcoFR
 
99
 
100
  ## Evaluation
101
 
102
+ The model is evaluated on the smaller development set of [mMARCO-fr](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/), which consists of 6,980 queries for which
103
+ an ensemble of 1000 passages containing the positive(s) and [ColBERTv2 hard negatives](https://huggingface.co/datasets/antoinelouis/msmarco-dev-small-negatives) need
104
+ to be reranked. We report the mean reciprocal rank (MRR) and recall at various cut-offs (R@k). To see how it compares to other neural retrievers in French, check out
105
+ the [*DécouvrIR*](https://huggingface.co/spaces/antoinelouis/decouvrir) leaderboard.
 
 
 
 
 
 
 
106
 
107
  ***
108
 
 
111
  #### Data
112
 
113
  We use the French training samples from the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset, a multilingual machine-translated version of MS MARCO
114
+ that contains 8.8M passages and 539K training queries. We do not use the BM25 negatives provided by the official dataset but instead sample harder negatives mined from
115
+ 12 distinct dense retrievers, using the [msmarco-hard-negatives](https://huggingface.co/datasets/sentence-transformers/msmarco-hard-negatives#msmarco-hard-negativesjsonlgz)
116
+ distillation dataset. Eventually, we sample 2.6M training triplets of the form (query, passage, relevance) with a positive-to-negative ratio of 1 (i.e., 50% of the pairs are
117
+ relevant and 50% are irrelevant).
118
 
119
  #### Implementation
120
 
121
  The model is initialized from the [cmarkea/distilcamembert-base](https://huggingface.co/cmarkea/distilcamembert-base) checkpoint and optimized via the binary cross-entropy loss
122
+ (as in [monoBERT](https://doi.org/10.48550/arXiv.1910.14424)). It is fine-tuned on one 80GB NVIDIA H100 GPU for 20k steps using the AdamW optimizer
123
+ with a batch size of 128 and a constant learning rate of 2e-5. We set the maximum sequence length of the concatenated question-passage pairs to 256 tokens.
124
+ We use the sigmoid function to get scores between 0 and 1.
125
 
126
  ***
127
 
128
  ## Citation
129
 
130
  ```bibtex
131
+ @online{louis2024decouvrir,
132
+ author = 'Antoine Louis',
133
+ title = 'DécouvrIR: A Benchmark for Evaluating the Robustness of Information Retrieval Models in French',
134
+ publisher = 'Hugging Face',
135
+ month = 'mar',
136
+ year = '2024',
137
+ url = 'https://huggingface.co/spaces/antoinelouis/decouvrir',
138
  }
139
+ ```