jay68 commited on
Commit
a542f32
·
1 Parent(s): 07b98e9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +150 -0
README.md ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gpl-3.0
3
+ tags:
4
+ - text2text-generation
5
+ pipeline_tag: text2text-generation
6
+ language:
7
+ - zh
8
+ - en
9
+ ---
10
+
11
+ Considering LLaMA's license constraints, the model is for research and learning only.
12
+ Please strictly respect LLaMA's usage policy. We are not allowed to publish weights for LLaMA, of course, even finetuned, but there is no problem publishing the difference, a patch that we suggest to apply to the files.
13
+ The encryption is a simple XOR between files, ensuring that only the people that have access to the original weights (from completely legal sources, of course) can transform them into finetuned weights.
14
+ You can find the decrypt code on https://github.com/LianjiaTech/BELLE/tree/main/models .
15
+
16
+
17
+ # Model Card for Model ID
18
+
19
+ ## Welcome
20
+ If you find this model helpful, please *like* this model and star us on https://github.com/LianjiaTech/BELLE !
21
+
22
+ ## Model description
23
+ We release the
24
+ [Towards Better Instruction Following Language Models for Chinese](https://github.com/LianjiaTech/BELLE/blob/main/docs/Towards%20Better%20Instruction%20Following%20Language%20Models%20for%20Chinese.pdf)
25
+
26
+
27
+
28
+ ## Training hyper-parameters
29
+ | Parameter | Value |
30
+ | ------ | ------ |
31
+ | Batch size | 16 |
32
+ | Learning rate | 5e-6 |
33
+ | Epochs | 3 |
34
+ |Weight_decay | 0.0 |
35
+ |Warmup_rate | 0.03 |
36
+ |LR_scheduler | cosine |
37
+
38
+ ## Download, Convert & Check
39
+ 1. After you git clone this model
40
+ ```
41
+ md5sum ./*
42
+ 45afa71e3067de5119233a57ef9d093d ./config.json.99a4ef2a26cb38c7f684cb83ed9343f660c561dd5a02a97d1b34b47419324dc5.enc
43
+ f9b33d359f17a437f6c24b4de6f2272e ./generation_config.json.fd7ff399e5568cc21a0a8414f43df88ef7c424995b9b97a90563165d2cf79efd.enc
44
+ 172013287b452114abf5c0e64936f45b ./pytorch_model-00001-of-00002.bin.166879223b7504f1632d72b1577d57bceaa8fdeee1857c61119e575c50a4aae5.enc
45
+ 384f8dc3b6da063c5f7554c52c531c44 ./pytorch_model-00002-of-00002.bin.2319db050dc286cb22c6e08a51a4ec0d9377017a7182a20a12c39eb658f39c80.enc
46
+ 2ac1e5262eefd012918724d68813d03e ./pytorch_model.bin.index.json.f56e69fedde5d28e4f37f2b62f74e8522bbfa13395a6d696d1ef99222a431ab7.enc
47
+ c066b68b4139328e87a694020fc3a6c3 ./special_tokens_map.json.ca3d163bab055381827226140568f3bef7eaac187cebd76878e0b63e9e442356.enc
48
+ 2d5d4156fd237fceae85f28d06751020 ./tokenizer_config.json.a672113277a674d753b5cdcfa6bfc860dc69bfcc5511bdccb0c6af3ed08873a0.enc
49
+ 39ec1b33fbf9a0934a8ae0f9a24c7163 ./tokenizer.model.9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347.enc
50
+ ```
51
+
52
+ 2. Decrypt the files using the scripts in https://github.com/LianjiaTech/BELLE/tree/main/models
53
+
54
+ You can use the following command in Bash.
55
+ Please replace "/path/to_encrypted" with the path where you stored your encrypted file,
56
+ replace "/path/to_original_llama_7B" with the path where you stored your original llama7B file,
57
+ and replace "/path/to_finetuned_model" with the path where you want to save your final trained model.
58
+
59
+ ```bash
60
+ mkdir /path/to_finetuned_model
61
+ for f in "/path/to_encrypted"/*; \
62
+ do if [ -f "$f" ]; then \
63
+ python3 decrypt.py "$f" "/path/to_original_llama_7B/consolidated.00.pth" "/path/to_finetuned_model/"; \
64
+ fi; \
65
+ done
66
+ ```
67
+
68
+ After executing the aforementioned command, you will obtain the following files.
69
+
70
+ ```
71
+ ./config.json
72
+ ./generation_config.json
73
+ ./pytorch_model-00001-of-00002.bin
74
+ ./pytorch_model-00002-of-00002.bin
75
+ ./pytorch_model.bin.index.json
76
+ ./special_tokens_map.json
77
+ ./tokenizer_config.json
78
+ ./tokenizer.model
79
+ ```
80
+
81
+ 3. Check md5sum
82
+
83
+ You can verify the integrity of these files by performing an MD5 checksum to ensure their complete recovery.
84
+ Here are the MD5 checksums for the relevant files:
85
+ ```
86
+ md5sum ./*
87
+ a57bf2d0d7ec2590740bc4175262610b ./config.json
88
+ 2917a1cafb895cf57e746cfd7696bfe5 ./generation_config.json
89
+ 252143e5ed0f0073dc5c04159a0f78c2 ./pytorch_model-00001-of-00002.bin
90
+ 3f71478bd783685f0a45fc742af85042 ./pytorch_model-00002-of-00002.bin
91
+ d5230ae5fb3bfd12df98af123be53cf5 ./pytorch_model.bin.index.json
92
+ 8a80554c91d9fca8acb82f023de02f11 ./special_tokens_map.json
93
+ 414f52220807d1300ad700283141de69 ./tokenizer_config.json
94
+ eeec4125e9c7560836b4873b6f8e3025 ./tokenizer.model
95
+ ```
96
+
97
+ ## Use model
98
+ Please note that the input should be formatted as follows in both **training** and **inference**.
99
+ ``` python
100
+ Human: {input} \n\nAssistant:
101
+ ```
102
+
103
+ In order to load BELLE-LLAMA-7B-2M-enc with huggingface transformers, please install the main version, as the latest stable version doesn't support LLAMA (as of March 26, 2023).
104
+ ``` python
105
+ pip install git+https://github.com/huggingface/transformers
106
+ ```
107
+
108
+ After you decrypt the files, BELLE-LLAMA-7B-2M can be easily loaded with LlamaForCausalLM.
109
+ ``` python
110
+ from transformers import LlamaForCausalLM, AutoTokenizer
111
+ import torch
112
+
113
+ ckpt = '/path/to_finetuned_model/'
114
+ device = torch.device('cuda')
115
+ model = LlamaForCausalLM.from_pretrained(ckpt, device_map='auto', low_cpu_mem_usage=True)
116
+ tokenizer = AutoTokenizer.from_pretrained(ckpt)
117
+ prompt = "Human: 写一首中文歌曲,赞美大自然 \n\nAssistant: "
118
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
119
+ generate_ids = model.generate(input_ids, max_new_tokens=500, do_sample = True, top_k = 30, top_p = 0.85, temperature = 0.5, repetition_penalty=1., eos_token_id=2, bos_token_id=1, pad_token_id=0)
120
+ output = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
121
+ response = output[len(prompt):]
122
+
123
+ ```
124
+
125
+ ## Limitations
126
+ There still exists a few issues in the model trained on current base model and data:
127
+
128
+ 1. The model might generate factual errors when asked to follow instructions related to facts.
129
+
130
+ 2. Occasionally generates harmful responses since the model still struggles to identify potential harmful instructions.
131
+
132
+ 3. Needs improvements on reasoning and coding.
133
+
134
+ Since the model still has its limitations, we require developers only use the open-sourced code, data, model and any other artifacts generated via this project for research purposes. Commercial use and other potential harmful use cases are not allowed.
135
+
136
+
137
+ ## Citation
138
+
139
+ Please cite us when using our code, data or model.
140
+
141
+ ```
142
+ @misc{BELLE,
143
+ author = {Yunjie Ji, Yong Deng, Yan Gong, Yiping Peng, Qiang Niu, Baochang Ma, Xiangang Li},
144
+ title = {BELLE: Be Everyone's Large Language model Engine},
145
+ year = {2023},
146
+ publisher = {GitHub},
147
+ journal = {GitHub repository},
148
+ howpublished = {\url{https://github.com/LianjiaTech/BELLE}},
149
+ }
150
+ ```