algorithmexplorer01
commited on
Commit
•
7313ccb
1
Parent(s):
6f38d12
Upload README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,86 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
<h1 align="center">
|
3 |
+
Long Bert Chinese
|
4 |
+
<br>
|
5 |
+
</h1>
|
6 |
+
|
7 |
+
<h4 align="center">
|
8 |
+
<p>
|
9 |
+
<b>简体中文</b> |
|
10 |
+
<a href="https://github.com/OctopusMind/long-bert-chinese/blob/main/README_EN.md">English</a>
|
11 |
+
</p>
|
12 |
+
</h4>
|
13 |
+
|
14 |
+
<p >
|
15 |
+
<br>
|
16 |
+
</p>
|
17 |
+
|
18 |
+
**Long Bert**: 长文本相似度模型,支持8192token长度。
|
19 |
+
基于bert-base-chinese,将原始BERT位置编码更改成ALiBi位置编码,使BERT可以支持8192的序列长度。
|
20 |
+
|
21 |
+
### News
|
22 |
+
* 支持`CoSENT`微调
|
23 |
+
* 模型已上传至 [Huggingface](https://huggingface.co/OctopusMind/LongBert)
|
24 |
+
|
25 |
+
|
26 |
+
### 使用
|
27 |
+
```python
|
28 |
+
from numpy.linalg import norm
|
29 |
+
from transformers import AutoModel
|
30 |
+
|
31 |
+
model_path = "OctopusMind/longbert-8k-zh"
|
32 |
+
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)
|
33 |
+
|
34 |
+
sentences = ['我是问蚂蚁借呗为什么不能提前结清欠款', "为什么借呗不能选择提前还款"]
|
35 |
+
embeddings = model.encode(sentences)
|
36 |
+
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
|
37 |
+
print(cos_sim(embeddings[0], embeddings[1]))
|
38 |
+
```
|
39 |
+
|
40 |
+
## 微调
|
41 |
+
### 数据格式
|
42 |
+
|
43 |
+
```json
|
44 |
+
[
|
45 |
+
{
|
46 |
+
"sentence1": "一个男人在吹一支大笛子。",
|
47 |
+
"sentence2": "一个人在吹长笛。",
|
48 |
+
"label": 3
|
49 |
+
},
|
50 |
+
{
|
51 |
+
"sentence1": "三个人在下棋。",
|
52 |
+
"sentence2": "两个人在下棋。",
|
53 |
+
"label": 2
|
54 |
+
},
|
55 |
+
{
|
56 |
+
"sentence1": "一个女人在写作。",
|
57 |
+
"sentence2": "一个女人在游泳。",
|
58 |
+
"label": 0
|
59 |
+
}
|
60 |
+
]
|
61 |
+
```
|
62 |
+
|
63 |
+
### CoSENT 微调
|
64 |
+
|
65 |
+
至`train/`路径下
|
66 |
+
```bash
|
67 |
+
cd train/
|
68 |
+
```
|
69 |
+
进行 CoSENT 微调
|
70 |
+
```bash
|
71 |
+
python cosent_finetune.py \
|
72 |
+
--data_dir ../data/train_data.json \
|
73 |
+
--output_dir ./outputs/my-model \
|
74 |
+
--max_seq_length 1024 \
|
75 |
+
--num_epochs 10 \
|
76 |
+
--batch_size 64 \
|
77 |
+
--learning_rate 2e-5
|
78 |
+
```
|
79 |
+
|
80 |
+
|
81 |
+
|
82 |
+
## 贡献
|
83 |
+
欢迎通过提交拉取请求或在仓库中提出问题来为此模块做出贡献。
|
84 |
+
|
85 |
+
## License
|
86 |
+
本项目遵循[Apache-2.0开源协议](./LICENSE)
|