Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,148 @@
|
|
1 |
---
|
2 |
-
license:
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
license: gpl-3.0
|
3 |
+
tags:
|
4 |
+
- text2text-generation
|
5 |
+
pipeline_tag: text2text-generation
|
6 |
+
language:
|
7 |
+
- zh
|
8 |
+
- en
|
9 |
---
|
10 |
+
|
11 |
+
Considering LLaMA's license constraints, the model is for research and learning only.
|
12 |
+
Please strictly respect LLaMA's usage policy. We are not allowed to publish weights for LLaMA, of course, even finetuned, but there is no problem publishing the difference, a patch that we suggest to apply to the files.
|
13 |
+
The encryption is a simple XOR between files, ensuring that only the people that have access to the original weights (from completely legal sources, of course) can transform them into finetuned weights.
|
14 |
+
You can find the decrypt code on https://github.com/LianjiaTech/BELLE/tree/main/models .
|
15 |
+
|
16 |
+
|
17 |
+
# Model Card for Model ID
|
18 |
+
|
19 |
+
## Welcome
|
20 |
+
If you find this model helpful, please *like* this model and star us on https://github.com/LianjiaTech/BELLE !
|
21 |
+
|
22 |
+
## Model description
|
23 |
+
This model comes from a two-phrase training on original LLaMA 13B.
|
24 |
+
1. Extending the vocabulary with additional 50K tokens specific for Chinese and further pretraining these word embeddings on Chinese corpus.
|
25 |
+
2. Full-parameter finetuning the model with 4M high-quality instruction-following examples.
|
26 |
+
|
27 |
+
|
28 |
+
## Download, Convert & Check
|
29 |
+
1. After you git clone this model
|
30 |
+
```
|
31 |
+
md5sum ./*
|
32 |
+
211b6252c73e638cb87e04edef1c91c6 config.json.7b4504868ddce248768954077a76ffe29a34c6cc2b4510426b4da77d1e9afb4c.enc
|
33 |
+
f9b33d359f17a437f6c24b4de6f2272e generation_config.json.fd7ff399e5568cc21a0a8414f43df88ef7c424995b9b97a90563165d2cf79efd.enc
|
34 |
+
07efffcfb738722f00c9b7ac81044bb9 pytorch_model-00001-of-00003.bin.1a523c0d01807d7fcde8d73537f09e346ff303a4769b8a6659114358621fc838.enc
|
35 |
+
fe66f8672c07e9e5bdfec4dd45e1e093 pytorch_model-00002-of-00003.bin.98e48fb6812bb87843c7276a85ed34124f67df5654d8cf0b6bb9302ecfe3a37f.enc
|
36 |
+
^@b3b4a0f1d6b399543d3d7ac50f9ce936 pytorch_model-00003-of-00003.bin.79921900f30a9ec501177fca2f593f90cb9f5ab235c05863cc4d384450cf3f6f.enc
|
37 |
+
7aef01bb265647be2a9acd1c7ea69bd8 pytorch_model.bin.index.json.af10ab40cc0368fba37018148447e3dcd9b72829a38e26c9eaf3eda3a7850b56.enc
|
38 |
+
34696bfce7b27548cfc2410e2b55762e special_tokens_map.json.96bdbb8504d9967606e5f661ccc7cbbac44a3661af863a7a58614670a0ccab33.enc
|
39 |
+
24e4f14cc3330576dcd1fd12760d35f3 tokenizer_config.json.2e333c3e1c77e7e9c6ceb573b02355deaf303ca8180bbac40f1d0405209ee457.enc
|
40 |
+
56724a79091f3d1877cca65c6412d646 tokenizer.model.0b716a618c9e7c45648f91d997431eba3b0ff111b17ce7b777280ed771a49f95.enc
|
41 |
+
```
|
42 |
+
|
43 |
+
2. Decrypt the files using the scripts in https://github.com/LianjiaTech/BELLE/tree/main/models
|
44 |
+
|
45 |
+
You can use the following command in Bash.
|
46 |
+
Please replace "/path/to_encrypted" with the path where you stored your encrypted file,
|
47 |
+
replace "/path/to_original_llama_7B" with the path where you stored your original llama7B file,
|
48 |
+
and replace "/path/to_finetuned_model" with the path where you want to save your final trained model.
|
49 |
+
|
50 |
+
```bash
|
51 |
+
mkdir /path/to_finetuned_model
|
52 |
+
for f in "/path/to_encrypted"/*; \
|
53 |
+
do if [ -f "$f" ]; then \
|
54 |
+
python3 decrypt.py "$f" "/path/to_original_llama_7B/consolidated.00.pth" "/path/to_finetuned_model/"; \
|
55 |
+
fi; \
|
56 |
+
done
|
57 |
+
```
|
58 |
+
|
59 |
+
After executing the aforementioned command, you will obtain the following files.
|
60 |
+
|
61 |
+
```
|
62 |
+
./config.json
|
63 |
+
./generation_config.json
|
64 |
+
./pytorch_model-00001-of-00002.bin
|
65 |
+
./pytorch_model-00002-of-00002.bin
|
66 |
+
./pytorch_model.bin.index.json
|
67 |
+
./special_tokens_map.json
|
68 |
+
./tokenizer_config.json
|
69 |
+
./tokenizer.model
|
70 |
+
```
|
71 |
+
|
72 |
+
3. Check md5sum
|
73 |
+
|
74 |
+
You can verify the integrity of these files by performing an MD5 checksum to ensure their complete recovery.
|
75 |
+
Here are the MD5 checksums for the relevant files:
|
76 |
+
```
|
77 |
+
md5sum ./*
|
78 |
+
1e28fe60969b1d4dcc3f97586082c5e5 config.json
|
79 |
+
2917a1cafb895cf57e746cfd7696bfe5 generation_config.json
|
80 |
+
2a8deacda3e22be63fe854da92006203 pytorch_model-00001-of-00003.bin
|
81 |
+
1bab042c86403f440517c8ae958716ed pytorch_model-00002-of-00003.bin
|
82 |
+
6fbd17996033fb5ec0263cdb07131de7 pytorch_model-00003-of-00003.bin
|
83 |
+
5762c0c9a1ca9366500390d0d335b2b6 pytorch_model.bin.index.json
|
84 |
+
15f7a943faa91a794f38dd81a212cb01 special_tokens_map.json
|
85 |
+
b87fab00f218c984135af5a0db353f22 tokenizer_config.json
|
86 |
+
6ffe559392973a92ea28032add2a8494 tokenizer.model
|
87 |
+
```
|
88 |
+
|
89 |
+
## Use model
|
90 |
+
Please note that the input should be formatted as follows in both **training** and **inference**.
|
91 |
+
``` python
|
92 |
+
Human: {input} \n\nBelle:
|
93 |
+
```
|
94 |
+
|
95 |
+
|
96 |
+
After you decrypt the files, BELLE-LLaMA-EXT-13B can be easily loaded with LlamaForCausalLM.
|
97 |
+
``` python
|
98 |
+
from transformers import LlamaForCausalLM, AutoTokenizer
|
99 |
+
import torch
|
100 |
+
|
101 |
+
ckpt = '/path/to_finetuned_model/'
|
102 |
+
device = torch.device('cuda')
|
103 |
+
model = LlamaForCausalLM.from_pretrained(ckpt, device_map='auto', low_cpu_mem_usage=True)
|
104 |
+
tokenizer = AutoTokenizer.from_pretrained(ckpt)
|
105 |
+
prompt = "Human: 写一首中文歌曲,赞美大自然 \n\nBelle: "
|
106 |
+
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
|
107 |
+
generate_ids = model.generate(input_ids, max_new_tokens=300, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.5,repetition_penalty=1.2, eos_token_id=2, bos_token_id=1, pad_token_id=0)
|
108 |
+
output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
|
109 |
+
response = output[len(prompt):]
|
110 |
+
print(response)
|
111 |
+
|
112 |
+
```
|
113 |
+
|
114 |
+
|
115 |
+
## Limitations
|
116 |
+
There still exists a few issues in the model trained on current base model and data:
|
117 |
+
|
118 |
+
1. The model might generate factual errors when asked to follow instructions related to facts.
|
119 |
+
|
120 |
+
2. Occasionally generates harmful responses since the model still struggles to identify potential harmful instructions.
|
121 |
+
|
122 |
+
3. Needs improvements on reasoning and coding.
|
123 |
+
|
124 |
+
Since the model still has its limitations, we require developers only use the open-sourced code, data, model and any other artifacts generated via this project for research purposes. Commercial use and other potential harmful use cases are not allowed.
|
125 |
+
|
126 |
+
|
127 |
+
## Citation
|
128 |
+
|
129 |
+
Please cite our paper and github when using our code, data or model.
|
130 |
+
|
131 |
+
```
|
132 |
+
@misc{ji2023better,
|
133 |
+
title={Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation},
|
134 |
+
author={Yunjie Ji and Yan Gong and Yong Deng and Yiping Peng and Qiang Niu and Baochang Ma and Xiangang Li},
|
135 |
+
year={2023},
|
136 |
+
eprint={2304.07854},
|
137 |
+
archivePrefix={arXiv},
|
138 |
+
primaryClass={cs.CL}
|
139 |
+
}
|
140 |
+
@misc{BELLE,
|
141 |
+
author = {Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Baochang Ma, Xiangang Li},
|
142 |
+
title = {BELLE: Be Everyone's Large Language model Engine},
|
143 |
+
year = {2023},
|
144 |
+
publisher = {GitHub},
|
145 |
+
journal = {GitHub repository},
|
146 |
+
howpublished = {\url{https://github.com/LianjiaTech/BELLE}},
|
147 |
+
}
|
148 |
+
```
|