jay68 commited on
Commit
deb6f3d
1 Parent(s): 1c2b883

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +146 -1
README.md CHANGED
@@ -1,3 +1,148 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: gpl-3.0
3
+ tags:
4
+ - text2text-generation
5
+ pipeline_tag: text2text-generation
6
+ language:
7
+ - zh
8
+ - en
9
  ---
10
+
11
+ Considering LLaMA's license constraints, the model is for research and learning only.
12
+ Please strictly respect LLaMA's usage policy. We are not allowed to publish weights for LLaMA, of course, even finetuned, but there is no problem publishing the difference, a patch that we suggest to apply to the files.
13
+ The encryption is a simple XOR between files, ensuring that only the people that have access to the original weights (from completely legal sources, of course) can transform them into finetuned weights.
14
+ You can find the decrypt code on https://github.com/LianjiaTech/BELLE/tree/main/models .
15
+
16
+
17
+ # Model Card for Model ID
18
+
19
+ ## Welcome
20
+ If you find this model helpful, please *like* this model and star us on https://github.com/LianjiaTech/BELLE !
21
+
22
+ ## Model description
23
+ This model comes from a two-phrase training on original LLaMA 13B.
24
+ 1. Extending the vocabulary with additional 50K tokens specific for Chinese and further pretraining these word embeddings on Chinese corpus.
25
+ 2. Full-parameter finetuning the model with 4M high-quality instruction-following examples.
26
+
27
+
28
+ ## Download, Convert & Check
29
+ 1. After you git clone this model
30
+ ```
31
+ md5sum ./*
32
+ 211b6252c73e638cb87e04edef1c91c6 config.json.7b4504868ddce248768954077a76ffe29a34c6cc2b4510426b4da77d1e9afb4c.enc
33
+ f9b33d359f17a437f6c24b4de6f2272e generation_config.json.fd7ff399e5568cc21a0a8414f43df88ef7c424995b9b97a90563165d2cf79efd.enc
34
+ 07efffcfb738722f00c9b7ac81044bb9 pytorch_model-00001-of-00003.bin.1a523c0d01807d7fcde8d73537f09e346ff303a4769b8a6659114358621fc838.enc
35
+ fe66f8672c07e9e5bdfec4dd45e1e093 pytorch_model-00002-of-00003.bin.98e48fb6812bb87843c7276a85ed34124f67df5654d8cf0b6bb9302ecfe3a37f.enc
36
+ ^@b3b4a0f1d6b399543d3d7ac50f9ce936 pytorch_model-00003-of-00003.bin.79921900f30a9ec501177fca2f593f90cb9f5ab235c05863cc4d384450cf3f6f.enc
37
+ 7aef01bb265647be2a9acd1c7ea69bd8 pytorch_model.bin.index.json.af10ab40cc0368fba37018148447e3dcd9b72829a38e26c9eaf3eda3a7850b56.enc
38
+ 34696bfce7b27548cfc2410e2b55762e special_tokens_map.json.96bdbb8504d9967606e5f661ccc7cbbac44a3661af863a7a58614670a0ccab33.enc
39
+ 24e4f14cc3330576dcd1fd12760d35f3 tokenizer_config.json.2e333c3e1c77e7e9c6ceb573b02355deaf303ca8180bbac40f1d0405209ee457.enc
40
+ 56724a79091f3d1877cca65c6412d646 tokenizer.model.0b716a618c9e7c45648f91d997431eba3b0ff111b17ce7b777280ed771a49f95.enc
41
+ ```
42
+
43
+ 2. Decrypt the files using the scripts in https://github.com/LianjiaTech/BELLE/tree/main/models
44
+
45
+ You can use the following command in Bash.
46
+ Please replace "/path/to_encrypted" with the path where you stored your encrypted file,
47
+ replace "/path/to_original_llama_7B" with the path where you stored your original llama7B file,
48
+ and replace "/path/to_finetuned_model" with the path where you want to save your final trained model.
49
+
50
+ ```bash
51
+ mkdir /path/to_finetuned_model
52
+ for f in "/path/to_encrypted"/*; \
53
+ do if [ -f "$f" ]; then \
54
+ python3 decrypt.py "$f" "/path/to_original_llama_7B/consolidated.00.pth" "/path/to_finetuned_model/"; \
55
+ fi; \
56
+ done
57
+ ```
58
+
59
+ After executing the aforementioned command, you will obtain the following files.
60
+
61
+ ```
62
+ ./config.json
63
+ ./generation_config.json
64
+ ./pytorch_model-00001-of-00002.bin
65
+ ./pytorch_model-00002-of-00002.bin
66
+ ./pytorch_model.bin.index.json
67
+ ./special_tokens_map.json
68
+ ./tokenizer_config.json
69
+ ./tokenizer.model
70
+ ```
71
+
72
+ 3. Check md5sum
73
+
74
+ You can verify the integrity of these files by performing an MD5 checksum to ensure their complete recovery.
75
+ Here are the MD5 checksums for the relevant files:
76
+ ```
77
+ md5sum ./*
78
+ 1e28fe60969b1d4dcc3f97586082c5e5 config.json
79
+ 2917a1cafb895cf57e746cfd7696bfe5 generation_config.json
80
+ 2a8deacda3e22be63fe854da92006203 pytorch_model-00001-of-00003.bin
81
+ 1bab042c86403f440517c8ae958716ed pytorch_model-00002-of-00003.bin
82
+ 6fbd17996033fb5ec0263cdb07131de7 pytorch_model-00003-of-00003.bin
83
+ 5762c0c9a1ca9366500390d0d335b2b6 pytorch_model.bin.index.json
84
+ 15f7a943faa91a794f38dd81a212cb01 special_tokens_map.json
85
+ b87fab00f218c984135af5a0db353f22 tokenizer_config.json
86
+ 6ffe559392973a92ea28032add2a8494 tokenizer.model
87
+ ```
88
+
89
+ ## Use model
90
+ Please note that the input should be formatted as follows in both **training** and **inference**.
91
+ ``` python
92
+ Human: {input} \n\nBelle:
93
+ ```
94
+
95
+
96
+ After you decrypt the files, BELLE-LLaMA-EXT-13B can be easily loaded with LlamaForCausalLM.
97
+ ``` python
98
+ from transformers import LlamaForCausalLM, AutoTokenizer
99
+ import torch
100
+
101
+ ckpt = '/path/to_finetuned_model/'
102
+ device = torch.device('cuda')
103
+ model = LlamaForCausalLM.from_pretrained(ckpt, device_map='auto', low_cpu_mem_usage=True)
104
+ tokenizer = AutoTokenizer.from_pretrained(ckpt)
105
+ prompt = "Human: 写一首中文歌曲,赞美大自然 \n\nBelle: "
106
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
107
+ generate_ids = model.generate(input_ids, max_new_tokens=300, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.5,repetition_penalty=1.2, eos_token_id=2, bos_token_id=1, pad_token_id=0)
108
+ output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
109
+ response = output[len(prompt):]
110
+ print(response)
111
+
112
+ ```
113
+
114
+
115
+ ## Limitations
116
+ There still exists a few issues in the model trained on current base model and data:
117
+
118
+ 1. The model might generate factual errors when asked to follow instructions related to facts.
119
+
120
+ 2. Occasionally generates harmful responses since the model still struggles to identify potential harmful instructions.
121
+
122
+ 3. Needs improvements on reasoning and coding.
123
+
124
+ Since the model still has its limitations, we require developers only use the open-sourced code, data, model and any other artifacts generated via this project for research purposes. Commercial use and other potential harmful use cases are not allowed.
125
+
126
+
127
+ ## Citation
128
+
129
+ Please cite our paper and github when using our code, data or model.
130
+
131
+ ```
132
+ @misc{ji2023better,
133
+ title={Towards Better Instruction Following Language Models for Chinese: Investigating the Impact of Training Data and Evaluation},
134
+ author={Yunjie Ji and Yan Gong and Yong Deng and Yiping Peng and Qiang Niu and Baochang Ma and Xiangang Li},
135
+ year={2023},
136
+ eprint={2304.07854},
137
+ archivePrefix={arXiv},
138
+ primaryClass={cs.CL}
139
+ }
140
+ @misc{BELLE,
141
+ author = {Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Baochang Ma, Xiangang Li},
142
+ title = {BELLE: Be Everyone's Large Language model Engine},
143
+ year = {2023},
144
+ publisher = {GitHub},
145
+ journal = {GitHub repository},
146
+ howpublished = {\url{https://github.com/LianjiaTech/BELLE}},
147
+ }
148
+ ```