Update README.md
Browse files
README.md
CHANGED
@@ -124,20 +124,22 @@ The approach of training the MSMARCO dataset with the Margin MSE loss method is
|
|
124 |
For this purpose [train_msmarco_v3_margin_MSE.py](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/training/train_msmarco_v3_margin_MSE.py) is provided by BEIR:
|
125 |
The unique feature here are the so-called "hard negatives", which were created by a special approach:
|
126 |
|
127 |
-
|
128 |
-
|
129 |
-
|
130 |
-
|
131 |
-
|
|
|
132 |
[Source](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/training/train_msmarco_v3_margin_MSE.py])
|
133 |
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
|
|
|
141 |
|
142 |
Since the MSMarco dataset has been translated into different languages and the "hard negatives" is only containing the IDs of queries and texts,
|
143 |
the approach just presented can also be applied to a language other than English.
|
@@ -183,7 +185,7 @@ The following table shows the evaluation results for different approaches and mo
|
|
183 |
|
184 |
**model**|**NDCG@1**|**NDCG@10**|**NDCG@100**|**comment**
|
185 |
:-----:|:-----:|:-----:|:-----:|:-----:
|
186 |
-
bi-encoder_msmarco_bert-base_german (new) | 0.5300 🏆 | 0.7196 🏆 |0.7360 🏆 | "OUR model"
|
187 |
[deepset/gbert-base-germandpr-X](https://huggingface.co/deepset/gbert-base-germandpr-ctx_encoder) | 0.4828 | 0.6970 | 0.7147 | "has two encoder models (one for queries and one for corpus), is SOTA approach"
|
188 |
[distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) | 0.4561 | 0.6347 | 0.6613 | "trained on 15 languages"
|
189 |
[paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 0.4511 | 0.6328 | 0.6592 | "trained on huge corpus, support for 50+ languages"
|
|
|
124 |
For this purpose [train_msmarco_v3_margin_MSE.py](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/training/train_msmarco_v3_margin_MSE.py) is provided by BEIR:
|
125 |
The unique feature here are the so-called "hard negatives", which were created by a special approach:
|
126 |
|
127 |
+
> We use the MSMARCO Hard Negatives File (Provided by Nils Reimers): https://sbert.net/datasets/msmarco-hard-negatives.jsonl.gz
|
128 |
+
> Negative passage are hard negative examples, that were mined using different dense embedding, cross-encoder methods and lexical search methods.
|
129 |
+
> Contains upto 50 negatives for each of the four retrieval systems: [bm25, msmarco-distilbert-base-tas-b, msmarco-MiniLM-L-6-v3, msmarco-distilbert-base-v3]
|
130 |
+
> Each positive and negative passage comes with a score from a Cross-Encoder (msmarco-MiniLM-L-6-v3). This allows denoising, i.e. removing false negative
|
131 |
+
> passages that are actually relevant for the query.
|
132 |
+
|
133 |
[Source](https://github.com/beir-cellar/beir/blob/main/examples/retrieval/training/train_msmarco_v3_margin_MSE.py])
|
134 |
|
135 |
+
> MarginMSELoss is based on the paper of Hofstätter et al. As for MultipleNegativesRankingLoss, we have triplets: (query, passage1, passage2). In contrast to MultipleNegativesRankingLoss, passage1 and passage2 do not have to be strictly positive/negative, both can be relevant or not relevant for a given query.
|
136 |
+
> We then compute the Cross-Encoder score for (query, passage1) and (query, passage2). We provide scores for 160 million such pairs in our msmarco-hard-negatives dataset. We then compute the distance: CE_distance = CEScore(query, passage1) - CEScore(query, passage2)
|
137 |
+
> For our bi-encoder training, we encode query, passage1, and passage2 into vector spaces and then measure the dot-product between (query, passage1) and (query, passage2). Again, we measure the distance: BE_distance = DotScore(query, passage1) - DotScore(query, passage2)
|
138 |
+
> We then want to ensure that the distance predicted by the bi-encoder is close to the distance predicted by the cross-encoder, i.e., we optimize the mean-squared error (MSE) between CE_distance and BE_distance.
|
139 |
+
> An advantage of MarginMSELoss compared to MultipleNegativesRankingLoss is that we don’t require a positive and negative passage. As mentioned before, MS MARCO is redundant, and many passages contain the same or similar content. With MarginMSELoss, we can train on two relevant passages without issues: In that case, the CE_distance will be smaller and we expect that our bi-encoder also puts both passages closer in the vector space.
|
140 |
+
> And disadvantage of MarginMSELoss is the slower training time: We need way more epochs to get good results. In MultipleNegativesRankingLoss, with a batch size of 64, we compare one query against 128 passages. With MarginMSELoss, we compare a query only against two passages.
|
141 |
+
>
|
142 |
+
> [Source](https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/ms_marco/README.md)
|
143 |
|
144 |
Since the MSMarco dataset has been translated into different languages and the "hard negatives" is only containing the IDs of queries and texts,
|
145 |
the approach just presented can also be applied to a language other than English.
|
|
|
185 |
|
186 |
**model**|**NDCG@1**|**NDCG@10**|**NDCG@100**|**comment**
|
187 |
:-----:|:-----:|:-----:|:-----:|:-----:
|
188 |
+
bi-encoder_msmarco_bert-base_german (new) | 0.5300 <br /> 🏆 | 0.7196 <br /> 🏆 | 0.7360 <br /> 🏆 | "OUR model"
|
189 |
[deepset/gbert-base-germandpr-X](https://huggingface.co/deepset/gbert-base-germandpr-ctx_encoder) | 0.4828 | 0.6970 | 0.7147 | "has two encoder models (one for queries and one for corpus), is SOTA approach"
|
190 |
[distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) | 0.4561 | 0.6347 | 0.6613 | "trained on 15 languages"
|
191 |
[paraphrase-multilingual-mpnet-base-v2](https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2) | 0.4511 | 0.6328 | 0.6592 | "trained on huge corpus, support for 50+ languages"
|