Update README.md
Browse files
README.md
CHANGED
@@ -1057,9 +1057,26 @@ model-index:
|
|
1057 |
|
1058 |
## stella model
|
1059 |
|
1060 |
-
|
1061 |
|
1062 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1063 |
|
1064 |
**训练数据:**
|
1065 |
|
@@ -1074,12 +1091,23 @@ stella是一个通用的中文文本编码模型,目前有两个版本:base
|
|
1074 |
4. cosent loss[5]
|
1075 |
5. 每一种类型的数据一个迭代器,分别计算loss进行更新
|
1076 |
|
|
|
|
|
|
|
1077 |
**初始权重:**\
|
1078 |
-
stella-base-zh和stella-large-zh分别以piccolo-base-zh[6]和piccolo-large-zh作为基础模型,512-1024的position
|
|
|
1079 |
感谢商汤科技研究院开源的[piccolo系列模型](https://huggingface.co/sensenova)。
|
1080 |
|
1081 |
-
stella is a general-purpose
|
1082 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1083 |
|
1084 |
The training data mainly includes:
|
1085 |
|
@@ -1101,21 +1129,72 @@ stella-base-zh and stella-large-zh use piccolo-base-zh and piccolo-large-zh as t
|
|
1101 |
Training strategy:\
|
1102 |
One iterator for each type of data, separately calculating the loss.
|
1103 |
|
|
|
|
|
1104 |
## Metric
|
1105 |
|
1106 |
-
#### C-MTEB leaderboard
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1107 |
|
1108 |
-
|
|
|
|
|
|
|
1109 |
|
1110 |
-
|
1111 |
-
|
1112 |
-
|
1113 |
-
|
1114 |
-
|
1115 |
-
|
1116 |
-
|
1117 |
-
|
1118 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1119 |
|
1120 |
#### Evaluation for long text
|
1121 |
|
@@ -1159,29 +1238,31 @@ passage长度为800多,大于512,但是对于这个question而言只需要
|
|
1159 |
| Multifieldqa_zh | 81.41 | 83.92 | 83.92 | 83.42 | 79.9 | 80.4 |
|
1160 |
| **Average** | 74.98 | 74.83 | 74.76 | 76.15 | **78.96** | **78.24** |
|
1161 |
|
1162 |
-
|
1163 |
**注意:** 因为长文本评测数据数量稀少,所以构造时也使用了train部分,如果自行评测,请注意模型的训练数据以免数据泄露。
|
1164 |
|
1165 |
## Usage
|
1166 |
|
1167 |
-
|
1168 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
1169 |
|
1170 |
在sentence-transformer库中的使用方法:
|
1171 |
|
1172 |
```python
|
1173 |
-
# 对于短对短数据集,下面是通用的使用方式
|
1174 |
from sentence_transformers import SentenceTransformer
|
1175 |
|
1176 |
sentences = ["数据1", "数据2"]
|
1177 |
-
model = SentenceTransformer('infgrad/stella-base-zh')
|
1178 |
print(model.max_seq_length)
|
1179 |
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
|
1180 |
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
|
1181 |
similarity = embeddings_1 @ embeddings_2.T
|
1182 |
print(similarity)
|
1183 |
-
# 如果是短对长数据集,推荐添加instruction,来帮助模型更好地进行检索。
|
1184 |
-
# 注意instruction里的是英文的冒号
|
1185 |
```
|
1186 |
|
1187 |
直接使用transformers库:
|
@@ -1190,8 +1271,8 @@ print(similarity)
|
|
1190 |
from transformers import AutoModel, AutoTokenizer
|
1191 |
from sklearn.preprocessing import normalize
|
1192 |
|
1193 |
-
model = AutoModel.from_pretrained('infgrad/stella-base-zh')
|
1194 |
-
tokenizer = AutoTokenizer.from_pretrained('infgrad/stella-base-zh')
|
1195 |
sentences = ["数据1", "数据ABCDEFGH"]
|
1196 |
batch_data = tokenizer(
|
1197 |
batch_text_or_text_pairs=sentences,
|
@@ -1208,6 +1289,46 @@ vectors = normalize(vectors, norm="l2", axis=1, )
|
|
1208 |
print(vectors.shape) # 2,768
|
1209 |
```
|
1210 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1211 |
## Training Detail
|
1212 |
|
1213 |
**硬件:** 单卡A100-80GB
|
@@ -1218,13 +1339,12 @@ print(vectors.shape) # 2,768
|
|
1218 |
|
1219 |
**batch_size:** base模型为1024,额外增加20%的难负例;large模型为768,额外增加20%的难负例
|
1220 |
|
1221 |
-
**数据量:**
|
1222 |
|
1223 |
## ToDoList
|
1224 |
|
1225 |
**评测的稳定性:**
|
1226 |
-
评测过程中发现Clustering任务会和官方的结果不一致,大约有±0.0x
|
1227 |
-
但是不完全一样还是比较难理解的,本人试了bge和piccolo系列的模型都存在这个问题,个人猜测可能和使用的库、batch_size等环境有关。
|
1228 |
|
1229 |
**更高质量的长文本训练和测试数据:** 训练数据多是用13b模型构造的,肯定会存在噪声。
|
1230 |
测试数据基本都是从mrc数据整理来的,所以问题都是factoid类型,不符合真实分布。
|
@@ -1245,4 +1365,3 @@ print(vectors.shape) # 2,768
|
|
1245 |
9. https://github.com/THUDM/LongBench
|
1246 |
|
1247 |
|
1248 |
-
|
|
|
1057 |
|
1058 |
## stella model
|
1059 |
|
1060 |
+
**新闻 | News**
|
1061 |
|
1062 |
+
**[2023-10-19]** 开源stella-base-en-v2 使用简单,**不需要任何前缀文本**。
|
1063 |
+
Release stella-base-en-v2. This model **does not need any prefix text**.\
|
1064 |
+
**[2023-10-12]** 开源stella-base-zh-v2和stella-large-zh-v2, 效果更好且使用简单,**不需要任何前缀文本**。
|
1065 |
+
Release stella-base-zh-v2 and stella-large-zh-v2. The 2 models have better performance
|
1066 |
+
and **do not need any prefix text**.\
|
1067 |
+
**[2023-09-11]** 开源stella-base-zh和stella-large-zh
|
1068 |
+
|
1069 |
+
stella是一个通用的文本编码模型,主要有以下模型:
|
1070 |
+
|
1071 |
+
| Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
|
1072 |
+
|:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
|
1073 |
+
| stella-base-en-v2 | 0.2 | 768 | 512 | English | No |
|
1074 |
+
| stella-large-zh-v2 | 0.65 | 1024 | 1024 | Chinese | No |
|
1075 |
+
| stella-base-zh-v2 | 0.2 | 768 | 1024 | Chinese | No |
|
1076 |
+
| stella-large-zh | 0.65 | 1024 | 1024 | Chinese | Yes |
|
1077 |
+
| stella-base-zh | 0.2 | 768 | 1024 | Chinese | Yes |
|
1078 |
+
|
1079 |
+
完整的训练思路和训练过程已记录在[博客1](https://zhuanlan.zhihu.com/p/655322183)和[博客2](https://zhuanlan.zhihu.com/p/662209559),欢迎阅读讨论。
|
1080 |
|
1081 |
**训练数据:**
|
1082 |
|
|
|
1091 |
4. cosent loss[5]
|
1092 |
5. 每一种类型的数据一个迭代器,分别计算loss进行更新
|
1093 |
|
1094 |
+
stella-v2在stella模型的基础上,使用了更多的训练数据,同时知识蒸馏等方法去除了前置的instruction(
|
1095 |
+
比如piccolo的`查询:`, `结果:`, e5的`query:`和`passage:`)。
|
1096 |
+
|
1097 |
**初始权重:**\
|
1098 |
+
stella-base-zh和stella-large-zh分别以piccolo-base-zh[6]和piccolo-large-zh作为基础模型,512-1024的position
|
1099 |
+
embedding使用层次分解位置编码[7]进行初始化。\
|
1100 |
感谢商汤科技研究院开源的[piccolo系列模型](https://huggingface.co/sensenova)。
|
1101 |
|
1102 |
+
stella is a general-purpose text encoder, which mainly includes the following models:
|
1103 |
+
|
1104 |
+
| Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
|
1105 |
+
|:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
|
1106 |
+
| stella-base-en-v2 | 0.2 | 768 | 512 | English | No |
|
1107 |
+
| stella-large-zh-v2 | 0.65 | 1024 | 1024 | Chinese | No |
|
1108 |
+
| stella-base-zh-v2 | 0.2 | 768 | 1024 | Chinese | No |
|
1109 |
+
| stella-large-zh | 0.65 | 1024 | 1024 | Chinese | Yes |
|
1110 |
+
| stella-base-zh | 0.2 | 768 | 1024 | Chinese | Yes |
|
1111 |
|
1112 |
The training data mainly includes:
|
1113 |
|
|
|
1129 |
Training strategy:\
|
1130 |
One iterator for each type of data, separately calculating the loss.
|
1131 |
|
1132 |
+
Based on stella models, stella-v2 use more training data and remove instruction by Knowledge Distillation.
|
1133 |
+
|
1134 |
## Metric
|
1135 |
|
1136 |
+
#### C-MTEB leaderboard (Chinese)
|
1137 |
+
|
1138 |
+
| Model Name | Model Size (GB) | Dimension | Sequence Length | Average (35) | Classification (9) | Clustering (4) | Pair Classification (2) | Reranking (4) | Retrieval (8) | STS (8) |
|
1139 |
+
|:------------------:|:---------------:|:---------:|:---------------:|:------------:|:------------------:|:--------------:|:-----------------------:|:-------------:|:-------------:|:-------:|
|
1140 |
+
| stella-large-zh-v2 | 0.65 | 1024 | 1024 | 65.13 | 69.05 | 49.16 | 82.68 | 66.41 | 70.14 | 58.66 |
|
1141 |
+
| stella-base-zh-v2 | 0.2 | 768 | 1024 | 64.36 | 68.29 | 49.4 | 79.95 | 66.1 | 70.08 | 56.92 |
|
1142 |
+
| stella-large-zh | 0.65 | 1024 | 1024 | 64.54 | 67.62 | 48.65 | 78.72 | 65.98 | 71.02 | 58.3 |
|
1143 |
+
| stella-base-zh | 0.2 | 768 | 1024 | 64.16 | 67.77 | 48.7 | 76.09 | 66.95 | 71.07 | 56.54 |
|
1144 |
+
|
1145 |
+
#### MTEB leaderboard (English)
|
1146 |
+
|
1147 |
+
| Model Name | Model Size (GB) | Dimension | Sequence Length | Average (56) | Classification (12) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Summarization (1) |
|
1148 |
+
|:-----------------:|:---------------:|:---------:|:---------------:|:------------:|:-------------------:|:---------------:|:-----------------------:|:-------------:|:--------------:|:--------:|:------------------:|
|
1149 |
+
| stella-base-en-v2 | 0.2 | 768 | 512 | 62.61 | 75.28 | 44.9 | 86.45 | 58.77 | 50.1 | 83.02 | 32.52 |
|
1150 |
+
|
1151 |
+
#### Reproduce our results
|
1152 |
+
|
1153 |
+
**C-MTEB:**
|
1154 |
+
|
1155 |
+
```python
|
1156 |
+
import torch
|
1157 |
+
import numpy as np
|
1158 |
+
from typing import List
|
1159 |
+
from mteb import MTEB
|
1160 |
+
from sentence_transformers import SentenceTransformer
|
1161 |
+
|
1162 |
|
1163 |
+
class FastTextEncoder():
|
1164 |
+
def __init__(self, model_name):
|
1165 |
+
self.model = SentenceTransformer(model_name).cuda().half().eval()
|
1166 |
+
self.model.max_seq_length = 512
|
1167 |
|
1168 |
+
def encode(
|
1169 |
+
self,
|
1170 |
+
input_texts: List[str],
|
1171 |
+
*args,
|
1172 |
+
**kwargs
|
1173 |
+
):
|
1174 |
+
new_sens = list(set(input_texts))
|
1175 |
+
new_sens.sort(key=lambda x: len(x), reverse=True)
|
1176 |
+
vecs = self.model.encode(
|
1177 |
+
new_sens, normalize_embeddings=True, convert_to_numpy=True, batch_size=256
|
1178 |
+
).astype(np.float32)
|
1179 |
+
sen2arrid = {sen: idx for idx, sen in enumerate(new_sens)}
|
1180 |
+
vecs = vecs[[sen2arrid[sen] for sen in input_texts]]
|
1181 |
+
torch.cuda.empty_cache()
|
1182 |
+
return vecs
|
1183 |
+
|
1184 |
+
|
1185 |
+
if __name__ == '__main__':
|
1186 |
+
model_name = "infgrad/stella-base-zh-v2"
|
1187 |
+
output_folder = "zh_mteb_results/stella-base-zh-v2"
|
1188 |
+
task_names = [t.description["name"] for t in MTEB(task_langs=['zh', 'zh-CN']).tasks]
|
1189 |
+
model = FastTextEncoder(model_name)
|
1190 |
+
for task in task_names:
|
1191 |
+
MTEB(tasks=[task], task_langs=['zh', 'zh-CN']).run(model, output_folder=output_folder)
|
1192 |
+
|
1193 |
+
```
|
1194 |
+
|
1195 |
+
**MTEB:**
|
1196 |
+
|
1197 |
+
You can use official script to reproduce our result. [scripts/run_mteb_english.py](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/run_mteb_english.py)
|
1198 |
|
1199 |
#### Evaluation for long text
|
1200 |
|
|
|
1238 |
| Multifieldqa_zh | 81.41 | 83.92 | 83.92 | 83.42 | 79.9 | 80.4 |
|
1239 |
| **Average** | 74.98 | 74.83 | 74.76 | 76.15 | **78.96** | **78.24** |
|
1240 |
|
|
|
1241 |
**注意:** 因为长文本评测数据数量稀少,所以构造时也使用了train部分,如果自行评测,请注意模型的训练数据以免数据泄露。
|
1242 |
|
1243 |
## Usage
|
1244 |
|
1245 |
+
#### stella 中文系列模型
|
1246 |
+
|
1247 |
+
stella-base-zh 和 stella-large-zh: 本模型是在piccolo基础上训练的,因此**用法和piccolo完全一致**
|
1248 |
+
,即在检索重排任务上给query和passage加上`查询: `和`结果: `。对于短短匹配不需要做任何操作。
|
1249 |
+
|
1250 |
+
stella-base-zh-v2 和 stella-large-zh-v2: 本模型使用简单,**任何使用场景中都不需要加前缀文本**。
|
1251 |
+
|
1252 |
+
stella中文系列模型均使用mean pooling做为文本向量。
|
1253 |
|
1254 |
在sentence-transformer库中的使用方法:
|
1255 |
|
1256 |
```python
|
|
|
1257 |
from sentence_transformers import SentenceTransformer
|
1258 |
|
1259 |
sentences = ["数据1", "数据2"]
|
1260 |
+
model = SentenceTransformer('infgrad/stella-base-zh-v2')
|
1261 |
print(model.max_seq_length)
|
1262 |
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
|
1263 |
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
|
1264 |
similarity = embeddings_1 @ embeddings_2.T
|
1265 |
print(similarity)
|
|
|
|
|
1266 |
```
|
1267 |
|
1268 |
直接使用transformers库:
|
|
|
1271 |
from transformers import AutoModel, AutoTokenizer
|
1272 |
from sklearn.preprocessing import normalize
|
1273 |
|
1274 |
+
model = AutoModel.from_pretrained('infgrad/stella-base-zh-v2')
|
1275 |
+
tokenizer = AutoTokenizer.from_pretrained('infgrad/stella-base-zh-v2')
|
1276 |
sentences = ["数据1", "数据ABCDEFGH"]
|
1277 |
batch_data = tokenizer(
|
1278 |
batch_text_or_text_pairs=sentences,
|
|
|
1289 |
print(vectors.shape) # 2,768
|
1290 |
```
|
1291 |
|
1292 |
+
#### stella models for English
|
1293 |
+
|
1294 |
+
**Using Sentence-Transformers:**
|
1295 |
+
|
1296 |
+
```python
|
1297 |
+
from sentence_transformers import SentenceTransformer
|
1298 |
+
|
1299 |
+
sentences = ["one car come", "one car go"]
|
1300 |
+
model = SentenceTransformer('infgrad/stella-base-en-v2')
|
1301 |
+
print(model.max_seq_length)
|
1302 |
+
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
|
1303 |
+
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
|
1304 |
+
similarity = embeddings_1 @ embeddings_2.T
|
1305 |
+
print(similarity)
|
1306 |
+
```
|
1307 |
+
|
1308 |
+
**Using HuggingFace Transformers:**
|
1309 |
+
|
1310 |
+
```python
|
1311 |
+
from transformers import AutoModel, AutoTokenizer
|
1312 |
+
from sklearn.preprocessing import normalize
|
1313 |
+
|
1314 |
+
model = AutoModel.from_pretrained('infgrad/stella-base-en-v2')
|
1315 |
+
tokenizer = AutoTokenizer.from_pretrained('infgrad/stella-base-en-v2')
|
1316 |
+
sentences = ["one car come", "one car go"]
|
1317 |
+
batch_data = tokenizer(
|
1318 |
+
batch_text_or_text_pairs=sentences,
|
1319 |
+
padding="longest",
|
1320 |
+
return_tensors="pt",
|
1321 |
+
max_length=512,
|
1322 |
+
truncation=True,
|
1323 |
+
)
|
1324 |
+
attention_mask = batch_data["attention_mask"]
|
1325 |
+
model_output = model(**batch_data)
|
1326 |
+
last_hidden = model_output.last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
|
1327 |
+
vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
|
1328 |
+
vectors = normalize(vectors, norm="l2", axis=1, )
|
1329 |
+
print(vectors.shape) # 2,768
|
1330 |
+
```
|
1331 |
+
|
1332 |
## Training Detail
|
1333 |
|
1334 |
**硬件:** 单卡A100-80GB
|
|
|
1339 |
|
1340 |
**batch_size:** base模型为1024,额外增加20%的难负例;large模型为768,额外增加20%的难负例
|
1341 |
|
1342 |
+
**数据量:** 第一版模型约100万,其中用LLM构造的数据约有200K. LLM模型大小为13b。v2系列模型到了2000万训练数据。
|
1343 |
|
1344 |
## ToDoList
|
1345 |
|
1346 |
**评测的稳定性:**
|
1347 |
+
评测过程中发现Clustering任务会和官方的结果不一致,大约有±0.0x的小差距,原因是聚类代码没有设置random_seed,差距可以忽略不计,不影响评测结论。
|
|
|
1348 |
|
1349 |
**更高质量的长文本训练和测试数据:** 训练数据多是用13b模型构造的,肯定会存在噪声。
|
1350 |
测试数据基本都是从mrc数据整理来的,所以问题都是factoid类型,不符合真实分布。
|
|
|
1365 |
9. https://github.com/THUDM/LongBench
|
1366 |
|
1367 |
|
|