izhx commited on
Commit
40ced75
·
verified ·
1 Parent(s): e244c93
Files changed (2) hide show
  1. README.md +26 -7
  2. README_zh.md +23 -5
README.md CHANGED
@@ -4,7 +4,9 @@ license: apache-2.0
4
 
5
  **English** | [中文](./README_zh.md)
6
 
7
- ## Code implementation of new GTE embeddings
 
 
8
 
9
  This model is a BERT-like encoder with the following optimizations implemented:
10
 
@@ -12,7 +14,6 @@ This model is a BERT-like encoder with the following optimizations implemented:
12
  2. Substituting the conventional activation functions with Gated Linear Units (GLU) [^2].
13
  3. Setting attention dropout to 0 to use `xformers` and `flash_attn`.
14
  4. Using unpadding to eliminate the needless computations for padding tokens [^3]. (this is off by default and should be used in conjunction with `xformers` for optimal acceleration).
15
- 5. Setting `vocab_size` as a multiple of 64.
16
 
17
  ### Recommendation: Enable Unpadding and Acceleration with `xformers`
18
 
@@ -31,7 +32,8 @@ elif pytorch is installed using pip:
31
  ```
32
  For more information, refer to [Installing xformers](https://github.com/facebookresearch/xformers?tab=readme-ov-file#installing-xformers).
33
 
34
- Then, when loading the model, set `unpad_inputs` and `use_memory_efficient_attention` to `true`, and enable `fp16` mixed precision computation to achieve the fastest acceleration.
 
35
 
36
  ```python
37
  import torch
@@ -45,15 +47,18 @@ model = AutoModel.from_pretrained(
45
  trust_remote_code=True,
46
  unpad_inputs=True,
47
  use_memory_efficient_attention=True,
 
48
  ).to(device)
49
 
50
- with torch.autocast(device_type=device.type, dtype=torch.float16): # or bfloat16
51
- with torch.inference_mode():
52
- outputs = model(**inputs.to(device))
 
53
 
54
  ```
55
 
56
- Alternatively, you can directly modify the `unpad_inputs` and `use_memory_efficient_attention` settings to `true` in the model's `config.json`, eliminating the need to set them in the code.
 
57
 
58
 
59
  ---
@@ -73,6 +78,20 @@ Without the outstanding work of `nomicai`, the release of `gte-v1.5` could have
73
 
74
  ---
75
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
76
  [^1]: Su, Jianlin, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. "Roformer: Enhanced transformer with rotary position embedding." Neurocomputing 568 (2024): 127063.
77
 
78
  [^2]: Shazeer, Noam. "Glu variants improve transformer." arXiv preprint arXiv:2002.05202 (2020).
 
4
 
5
  **English** | [中文](./README_zh.md)
6
 
7
+ [Arxiv PDF](https://arxiv.org/pdf/2407.19669), [HF paper page](https://huggingface.co/papers/2407.19669)
8
+
9
+ ## Code implementation of new GTE encoders
10
 
11
  This model is a BERT-like encoder with the following optimizations implemented:
12
 
 
14
  2. Substituting the conventional activation functions with Gated Linear Units (GLU) [^2].
15
  3. Setting attention dropout to 0 to use `xformers` and `flash_attn`.
16
  4. Using unpadding to eliminate the needless computations for padding tokens [^3]. (this is off by default and should be used in conjunction with `xformers` for optimal acceleration).
 
17
 
18
  ### Recommendation: Enable Unpadding and Acceleration with `xformers`
19
 
 
32
  ```
33
  For more information, refer to [Installing xformers](https://github.com/facebookresearch/xformers?tab=readme-ov-file#installing-xformers).
34
 
35
+ Then, when loading the model, set `unpad_inputs` and `use_memory_efficient_attention` to `true`,
36
+ and set `torch_dtype` to `torch.float16` (or `torch.bfloat16`) to achieve the acceleration.
37
 
38
  ```python
39
  import torch
 
47
  trust_remote_code=True,
48
  unpad_inputs=True,
49
  use_memory_efficient_attention=True,
50
+ torch_dtype=torch.float16
51
  ).to(device)
52
 
53
+ inputs = tokenzier(['test input'], truncation=True, max_length=8192, padding=True, return_tensors='pt')
54
+
55
+ with torch.inference_mode():
56
+ outputs = model(**inputs.to(device))
57
 
58
  ```
59
 
60
+ Alternatively, you can directly modify the `unpad_inputs` and `use_memory_efficient_attention` settings to `true` in the model's `config.json`,
61
+ eliminating the need to set them in the code.
62
 
63
 
64
  ---
 
78
 
79
  ---
80
 
81
+ ## Citation
82
+ ```
83
+ @misc{zhang2024mgte,
84
+ title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
85
+ author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
86
+ year={2024},
87
+ eprint={2407.19669},
88
+ archivePrefix={arXiv},
89
+ primaryClass={cs.CL},
90
+ url={https://arxiv.org/abs/2407.19669},
91
+ }
92
+ ```
93
+
94
+
95
  [^1]: Su, Jianlin, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. "Roformer: Enhanced transformer with rotary position embedding." Neurocomputing 568 (2024): 127063.
96
 
97
  [^2]: Shazeer, Noam. "Glu variants improve transformer." arXiv preprint arXiv:2002.05202 (2020).
README_zh.md CHANGED
@@ -4,6 +4,8 @@ license: apache-2.0
4
 
5
  [English](./README.md) | **中文**
6
 
 
 
7
  ## GTE 新模型代码实现
8
 
9
  此模型为 BERT-like 编码器模型,加入了以下优化:
@@ -12,7 +14,6 @@ license: apache-2.0
12
  2. 使用 GLU (Gated Linear Unit) [^2] 替换普通的激活函数。
13
  3. 设置 attention dropout 为 0 以方便应用 `xformers` 和 `flash_attn` 等优化。
14
  4. 使用 Unpadding 技术去除对 padding token 的无用计算 [^3](默认关闭,需要结合 `flash_attn` 或 `xformers` 使用来获得最高加速)。
15
- 5. 设置 `vocab_size % 64 = 0`。
16
 
17
 
18
  ### 推荐:启用 Unpadding 和 xformers 加速
@@ -32,7 +33,7 @@ elif pytorch 使用 pip 安装 :
32
  ```
33
  更多信息可参考 [installing-xformers](https://github.com/facebookresearch/xformers?tab=readme-ov-file#installing-xformers)。
34
 
35
- 然后,加载模型时设置 `unpad_inputs` 和 `use_memory_efficient_attention` 为 `true`,并启用 `fp16` 混合精度计算,即可获得最快加速。
36
 
37
  ```python
38
  import torch
@@ -46,11 +47,13 @@ model = AutoModel.from_pretrained(
46
  trust_remote_code=True,
47
  unpad_inputs=True,
48
  use_memory_efficient_attention=True,
 
49
  ).to(device)
50
 
51
- with torch.autocast(device_type=device.type, dtype=torch.float16): # 或bfloat16
52
- with torch.inference_mode():
53
- outputs = model(**inputs.to(device))
 
54
 
55
  ```
56
  也可以直接修改模型的 `config.json` 中 `unpad_inputs` 和 `use_memory_efficient_attention` 为 `true`,省去代码中的设置。
@@ -72,6 +75,21 @@ with torch.autocast(device_type=device.type, dtype=torch.float16): # 或bfloat1
72
 
73
  ---
74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  [^1]: Su, Jianlin, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. "Roformer: Enhanced transformer with rotary position embedding." Neurocomputing 568 (2024): 127063.
76
 
77
  [^2]: Shazeer, Noam. "Glu variants improve transformer." arXiv preprint arXiv:2002.05202 (2020).
 
4
 
5
  [English](./README.md) | **中文**
6
 
7
+ [Arxiv PDF](https://arxiv.org/pdf/2407.19669), [HF paper page](https://huggingface.co/papers/2407.19669)
8
+
9
  ## GTE 新模型代码实现
10
 
11
  此模型为 BERT-like 编码器模型,加入了以下优化:
 
14
  2. 使用 GLU (Gated Linear Unit) [^2] 替换普通的激活函数。
15
  3. 设置 attention dropout 为 0 以方便应用 `xformers` 和 `flash_attn` 等优化。
16
  4. 使用 Unpadding 技术去除对 padding token 的无用计算 [^3](默认关闭,需要结合 `flash_attn` 或 `xformers` 使用来获得最高加速)。
 
17
 
18
 
19
  ### 推荐:启用 Unpadding 和 xformers 加速
 
33
  ```
34
  更多信息可参考 [installing-xformers](https://github.com/facebookresearch/xformers?tab=readme-ov-file#installing-xformers)。
35
 
36
+ 然后,加载模型时设置 `unpad_inputs` 和 `use_memory_efficient_attention` 为 `true`,并设置 `torch_dtype` 为 `torch.float16` (or `torch.bfloat16`),即可获得加速。
37
 
38
  ```python
39
  import torch
 
47
  trust_remote_code=True,
48
  unpad_inputs=True,
49
  use_memory_efficient_attention=True,
50
+ torch_dtype=torch.float16
51
  ).to(device)
52
 
53
+ inputs = tokenzier(['test input'], truncation=True, max_length=8192, padding=True, return_tensors='pt')
54
+
55
+ with torch.inference_mode():
56
+ outputs = model(**inputs.to(device))
57
 
58
  ```
59
  也可以直接修改模型的 `config.json` 中 `unpad_inputs` 和 `use_memory_efficient_attention` 为 `true`,省去代码中的设置。
 
75
 
76
  ---
77
 
78
+
79
+ ## Citation
80
+ ```
81
+ @misc{zhang2024mgte,
82
+ title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
83
+ author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
84
+ year={2024},
85
+ eprint={2407.19669},
86
+ archivePrefix={arXiv},
87
+ primaryClass={cs.CL},
88
+ url={https://arxiv.org/abs/2407.19669},
89
+ }
90
+ ```
91
+
92
+
93
  [^1]: Su, Jianlin, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. "Roformer: Enhanced transformer with rotary position embedding." Neurocomputing 568 (2024): 127063.
94
 
95
  [^2]: Shazeer, Noam. "Glu variants improve transformer." arXiv preprint arXiv:2002.05202 (2020).