Update README.md
Browse files
README.md
CHANGED
@@ -2622,7 +2622,8 @@ a SOTA instruction-tuned multi-lingual embedding model that ranked 2nd in MTEB a
|
|
2622 |
|
2623 |
- **Developed by:** Institute for Intelligent Computing, Alibaba Group
|
2624 |
- **Model type:** Text Embeddings
|
2625 |
-
- **Paper:**
|
|
|
2626 |
|
2627 |
<!-- - **Demo [optional]:** [More Information Needed] -->
|
2628 |
|
@@ -2717,7 +2718,7 @@ console.log(similarities); // [34.504930869007296, 64.03973265120138, 19.5200426
|
|
2717 |
### Training Data
|
2718 |
|
2719 |
- Masked language modeling (MLM): `c4-en`
|
2720 |
-
- Weak-supervised contrastive
|
2721 |
- Supervised contrastive fine-tuning: [GTE](https://arxiv.org/pdf/2308.03281.pdf) fine-tuning data
|
2722 |
|
2723 |
### Training Procedure
|
@@ -2728,8 +2729,8 @@ And then, we resample the data, reducing the proportion of short texts, and cont
|
|
2728 |
|
2729 |
The entire training process is as follows:
|
2730 |
- MLM-2048: lr 5e-4, mlm_probability 0.3, batch_size 4096, num_steps 70000, rope_base 10000
|
2731 |
-
- MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 20000, rope_base 500000
|
2732 |
-
-
|
2733 |
- Fine-tuning: TODO
|
2734 |
|
2735 |
|
@@ -2766,12 +2767,22 @@ The gte evaluation setting: `mteb==1.2.0, fp16 auto mix precision, max_length=81
|
|
2766 |
If you find our paper or models helpful, please consider citing them as follows:
|
2767 |
|
2768 |
```
|
2769 |
-
@
|
2770 |
-
title={
|
2771 |
-
author={
|
2772 |
-
|
2773 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2774 |
}
|
2775 |
```
|
2776 |
-
|
2777 |
-
|
|
|
2622 |
|
2623 |
- **Developed by:** Institute for Intelligent Computing, Alibaba Group
|
2624 |
- **Model type:** Text Embeddings
|
2625 |
+
- **Paper:** [mGTE: Generalized Long-Context Text Representation and Reranking
|
2626 |
+
Models for Multilingual Text Retrieval](https://arxiv.org/pdf/2407.19669)
|
2627 |
|
2628 |
<!-- - **Demo [optional]:** [More Information Needed] -->
|
2629 |
|
|
|
2718 |
### Training Data
|
2719 |
|
2720 |
- Masked language modeling (MLM): `c4-en`
|
2721 |
+
- Weak-supervised contrastive pre-training (CPT): [GTE](https://arxiv.org/pdf/2308.03281.pdf) pre-training data
|
2722 |
- Supervised contrastive fine-tuning: [GTE](https://arxiv.org/pdf/2308.03281.pdf) fine-tuning data
|
2723 |
|
2724 |
### Training Procedure
|
|
|
2729 |
|
2730 |
The entire training process is as follows:
|
2731 |
- MLM-2048: lr 5e-4, mlm_probability 0.3, batch_size 4096, num_steps 70000, rope_base 10000
|
2732 |
+
- [MLM-8192](https://huggingface.co/Alibaba-NLP/gte-en-mlm-base): lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 20000, rope_base 500000
|
2733 |
+
- CPT: max_len 512, lr 2e-4, batch_size 32768, num_steps 100000
|
2734 |
- Fine-tuning: TODO
|
2735 |
|
2736 |
|
|
|
2767 |
If you find our paper or models helpful, please consider citing them as follows:
|
2768 |
|
2769 |
```
|
2770 |
+
@misc{zhang2024mgte,
|
2771 |
+
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
|
2772 |
+
author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
|
2773 |
+
year={2024},
|
2774 |
+
eprint={2407.19669},
|
2775 |
+
archivePrefix={arXiv},
|
2776 |
+
primaryClass={cs.CL},
|
2777 |
+
url={https://arxiv.org/abs/2407.19669},
|
2778 |
+
}
|
2779 |
+
@misc{li2023gte,
|
2780 |
+
title={Towards General Text Embeddings with Multi-stage Contrastive Learning},
|
2781 |
+
author={Zehan Li and Xin Zhang and Yanzhao Zhang and Dingkun Long and Pengjun Xie and Meishan Zhang},
|
2782 |
+
year={2023},
|
2783 |
+
eprint={2308.03281},
|
2784 |
+
archivePrefix={arXiv},
|
2785 |
+
primaryClass={cs.CL},
|
2786 |
+
url={https://arxiv.org/abs/2308.03281},
|
2787 |
}
|
2788 |
```
|
|
|
|