update
Browse files- README.md +207 -0
- config.json +26 -0
- pytorch_model.bin +3 -0
- rinna.png +0 -0
- spiece.model +3 -0
- spiece.vocab +0 -0
- tokenizer_config.json +1 -0
README.md
CHANGED
@@ -1,3 +1,210 @@
|
|
1 |
---
|
|
|
2 |
license: mit
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
thumbnail: https://github.com/rinnakk/japanese-pretrained-models/blob/master/rinna.png
|
3 |
license: mit
|
4 |
+
datasets:
|
5 |
+
- mc4
|
6 |
+
- cc100
|
7 |
+
- wikipedia
|
8 |
+
- EleutherAI/pile
|
9 |
+
- togethercomputer/RedPajama-Data-1T
|
10 |
+
language:
|
11 |
+
- ja
|
12 |
+
- en
|
13 |
+
inference: false
|
14 |
---
|
15 |
+
|
16 |
+
# bilingual-gpt-neox-4b
|
17 |
+
|
18 |
+
![rinna-icon](./rinna.png)
|
19 |
+
|
20 |
+
# Overview
|
21 |
+
This repository provides an English-Japanese bilingual GPT-NeoX model of 3.8 billion parameters.
|
22 |
+
|
23 |
+
* **Library**
|
24 |
+
|
25 |
+
The model was trained using code based on [EleutherAI/gpt-neox](https://github.com/EleutherAI/gpt-neox).
|
26 |
+
|
27 |
+
* **Model architecture**
|
28 |
+
|
29 |
+
A 36-layer, 2816-hidden-size transformer-based language model.
|
30 |
+
|
31 |
+
* **Pre-training**
|
32 |
+
|
33 |
+
The model was trained on around **524B** tokens from a mixture of the following corpora
|
34 |
+
|
35 |
+
- [Japanese CC-100](http://data.statmt.org/cc-100/ja.txt.xz)
|
36 |
+
- [Japanese C4](https://huggingface.co/datasets/mc4)
|
37 |
+
- [The Pile](https://huggingface.co/datasets/EleutherAI/pile)
|
38 |
+
- [Redpajama](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T)
|
39 |
+
- [Wikipedia](https://dumps.wikimedia.org/other/cirrussearch)
|
40 |
+
|
41 |
+
* **Model Series**
|
42 |
+
|
43 |
+
| Variant | Link |
|
44 |
+
| :-- | :--|
|
45 |
+
| Bilingual 4B MiniGPT4 | https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4 |
|
46 |
+
| Bilingual 4B SFT | https://huggingface.co/rinna/bilingual-gpt-neox-4b-instruction-sft |
|
47 |
+
| Bilingual 4B 8K | https://huggingface.co/rinna/bilingual-gpt-neox-4b-8k |
|
48 |
+
| Bilingual 4B | https://huggingface.co/rinna/bilingual-gpt-neox-4b |
|
49 |
+
| Japanese 3.6B PPO | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-ppo |
|
50 |
+
| Japanese 3.6B SFT-v2 | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft-v2 |
|
51 |
+
| Japanese 3.6B SFT | https://huggingface.co/rinna/japanese-gpt-neox-3.6b-instruction-sft |
|
52 |
+
| Japanese 3.6B | https://huggingface.co/rinna/japanese-gpt-neox-3.6b |
|
53 |
+
|
54 |
+
* **Authors**
|
55 |
+
|
56 |
+
[Tianyu Zhao](https://huggingface.co/tianyuz) and [Kei Sawada](https://huggingface.co/keisawada)
|
57 |
+
|
58 |
+
---
|
59 |
+
|
60 |
+
# Benchmarking
|
61 |
+
|
62 |
+
* **Japanese benchmark**
|
63 |
+
|
64 |
+
Our evaluation experiments suggest that the bilingual-gpt-neox-4b model performs slightly better than the previous [Japanese GPT-NeoX 3.6B](https://huggingface.co/rinna/japanese-gpt-neox-3.6b) in Japanese tasks.
|
65 |
+
- *The 4-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, and JSQuAD.*
|
66 |
+
- *The 6-task average accuracy is based on results of JCommonsenseQA, JNLI, MARC-ja, JSQuAD, XWinograd, and JAQKET-v2.*
|
67 |
+
|
68 |
+
| Model | 4-task average accuracy | 6-task average accuracy |
|
69 |
+
| :-- | :-- | :-- |
|
70 |
+
| bilingual-gpt-neox-4b-instruction-sft | 59.25 | 60.59 |
|
71 |
+
| **bilingual-gpt-neox-4b** | **56.12** | **51.83** |
|
72 |
+
| japanese-gpt-neox-3.6b-instruction-ppo | 59.86 | 60.07 |
|
73 |
+
| japanese-gpt-neox-3.6b | 55.07 | 50.32 |
|
74 |
+
|
75 |
+
* **English benchmark**
|
76 |
+
|
77 |
+
Using the [EleutherAI Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/master), we found the bilingual-gpt-neox-4b performs comparably with English/multilingual models of similar sizes.
|
78 |
+
- *The average accuracy is based on results of Arc-Challenge, Arc-Easy, BoolQ, COPA, HellaSwag, OpenBookQA, PIQA, PROST, SWAG, and WinoGrande.*
|
79 |
+
|
80 |
+
| Model | Average accuracy |
|
81 |
+
| :-- | :-- |
|
82 |
+
| mpt-7b | 59.30 |
|
83 |
+
| llama-7b | 57.35 |
|
84 |
+
| bloom-7b | 51.51 |
|
85 |
+
| xglm-7.5b | 50.96 |
|
86 |
+
| xglm-4.5b | 50.15 |
|
87 |
+
| **bilingual-gpt-neox-4b** | **49.49** |
|
88 |
+
| bloom-3b | 48.56 |
|
89 |
+
| xglm-2.9b | 47.44 |
|
90 |
+
| bloom-1.7b | 46.54 |
|
91 |
+
|
92 |
+
---
|
93 |
+
|
94 |
+
# How to use the model
|
95 |
+
|
96 |
+
**Notice:** Since the model is **sensitive to decoding hyper-parameters** (e.g. `temperature`, `top_p`, `top_k`, `repetition_penalty`), it is suggested to explore the best setting for your task.
|
97 |
+
|
98 |
+
~~~~python
|
99 |
+
import torch
|
100 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
101 |
+
|
102 |
+
tokenizer = AutoTokenizer.from_pretrained("rinna/bilingual-gpt-neox-4b", use_fast=False)
|
103 |
+
model = AutoModelForCausalLM.from_pretrained("rinna/bilingual-gpt-neox-4b")
|
104 |
+
|
105 |
+
if torch.cuda.is_available():
|
106 |
+
model = model.to("cuda")
|
107 |
+
|
108 |
+
text = "西田幾多郎は、"
|
109 |
+
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
|
110 |
+
|
111 |
+
with torch.no_grad():
|
112 |
+
output_ids = model.generate(
|
113 |
+
token_ids.to(model.device),
|
114 |
+
max_new_tokens=100,
|
115 |
+
min_new_tokens=100,
|
116 |
+
do_sample=True,
|
117 |
+
temperature=1.0,
|
118 |
+
top_p=0.95,
|
119 |
+
pad_token_id=tokenizer.pad_token_id,
|
120 |
+
bos_token_id=tokenizer.bos_token_id,
|
121 |
+
eos_token_id=tokenizer.eos_token_id
|
122 |
+
)
|
123 |
+
|
124 |
+
output = tokenizer.decode(output_ids.tolist()[0])
|
125 |
+
print(output)
|
126 |
+
"""
|
127 |
+
西田幾多郎は、その著書「自覚の哲学」の中で、次のように書きました。
|
128 |
+
「知識を、自分のものと考えることに満足していると、自己の限界に目覚めることを忘れてしまう。しかし、他者との協同なしには、自己の本当の理解に達することはできないのだ。知識は他者と相互の、協同の力によってこそ、得られるのである。」(引用終わり)
|
129 |
+
この一節を、私たちは今から学び直すべきです。そして、これからの社会をリードする子どもたちに、その能力を伸ばすべく、
|
130 |
+
"""
|
131 |
+
~~~~
|
132 |
+
|
133 |
+
~~~~python
|
134 |
+
text = "Socrates says"
|
135 |
+
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
|
136 |
+
|
137 |
+
with torch.no_grad():
|
138 |
+
output_ids = model.generate(
|
139 |
+
token_ids.to(model.device),
|
140 |
+
max_new_tokens=100,
|
141 |
+
min_new_tokens=100,
|
142 |
+
do_sample=True,
|
143 |
+
temperature=1.0,
|
144 |
+
top_p=0.95,
|
145 |
+
pad_token_id=tokenizer.pad_token_id,
|
146 |
+
bos_token_id=tokenizer.bos_token_id,
|
147 |
+
eos_token_id=tokenizer.eos_token_id
|
148 |
+
)
|
149 |
+
|
150 |
+
output = tokenizer.decode(output_ids.tolist()[0])
|
151 |
+
print(output)
|
152 |
+
|
153 |
+
"""
|
154 |
+
Socrates says: he thinks that philosophy, as opposed to myth, can be demonstrated; as opposed to poetry, that it is not possible to have knowledge of the unknowable (that is, neither by reason nor by any art of divination). So in this case he is in agreement with Socrates in not thinking that we could prove the existence of the gods or of fate. Now, I do not know the content of Xenophon's _Symposium_, but he must have made a point of this passage that has ex
|
155 |
+
"""
|
156 |
+
~~~~
|
157 |
+
|
158 |
+
~~~~python
|
159 |
+
text = "def bubble_sort(array):"
|
160 |
+
token_ids = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt")
|
161 |
+
|
162 |
+
with torch.no_grad():
|
163 |
+
output_ids = model.generate(
|
164 |
+
token_ids.to(model.device),
|
165 |
+
max_new_tokens=200,
|
166 |
+
min_new_tokens=200,
|
167 |
+
do_sample=True,
|
168 |
+
temperature=1.0,
|
169 |
+
top_p=0.5,
|
170 |
+
pad_token_id=tokenizer.pad_token_id,
|
171 |
+
bos_token_id=tokenizer.bos_token_id,
|
172 |
+
eos_token_id=tokenizer.eos_token_id
|
173 |
+
)
|
174 |
+
|
175 |
+
output = tokenizer.decode(output_ids.tolist()[0])
|
176 |
+
print(output)
|
177 |
+
"""
|
178 |
+
def bubble_sort(array):
|
179 |
+
for i in range(len(array)):
|
180 |
+
for j in range(len(array)-1):
|
181 |
+
if array[j] > array[j+1]:
|
182 |
+
array[j], array[j+1] = array[j+1], array[j]
|
183 |
+
return array
|
184 |
+
|
185 |
+
print(bubble_sort([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))
|
186 |
+
|
187 |
+
The code above will sort the array from 1 to 10 in the following order:
|
188 |
+
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
|
189 |
+
|
190 |
+
However, I am not sure how to do
|
191 |
+
"""
|
192 |
+
~~~~
|
193 |
+
|
194 |
+
---
|
195 |
+
|
196 |
+
# Tokenization
|
197 |
+
The model uses a [sentencepiece](https://github.com/google/sentencepiece)-based tokenizer.
|
198 |
+
* The tokenizer has a vocabulary size of 65,536.
|
199 |
+
* It uses *byte fallback* to decompose unknown text pieces into UTF-8 byte pieces to avoid producing `<UNK>` tokens.
|
200 |
+
* It can recognize *consecutive whitespaces*, *newlines*, and *tabs* to handle structured texts better.
|
201 |
+
* We turned off the default behaviour of prepending leading whitespace because it is not beneficial for processing Japanese.
|
202 |
+
* Specifically, single whitespace is always processed as one token so that any English word won't have a preceding whitespace like in many other tokenizers (e.g. `_Hello`).
|
203 |
+
* This decision trades the English processing efficiency for a unified way to treat whitespaces.
|
204 |
+
* It leads to a significantly lower loss of next token prediction on English data because whitespaces are easy to predict.
|
205 |
+
* **Don't forget to set `use_fast=False` to make the above features function correctly.**
|
206 |
+
|
207 |
+
---
|
208 |
+
|
209 |
+
# Licenese
|
210 |
+
[The MIT license](https://opensource.org/licenses/MIT)
|
config.json
ADDED
@@ -0,0 +1,26 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"architectures": [
|
3 |
+
"GPTNeoXForCausalLM"
|
4 |
+
],
|
5 |
+
"attention_dropout": 0.1,
|
6 |
+
"bos_token_id": 2,
|
7 |
+
"classifier_dropout": 0.1,
|
8 |
+
"eos_token_id": 3,
|
9 |
+
"hidden_act": "gelu",
|
10 |
+
"hidden_dropout": 0.1,
|
11 |
+
"hidden_size": 2816,
|
12 |
+
"initializer_range": 0.02,
|
13 |
+
"intermediate_size": 11264,
|
14 |
+
"layer_norm_eps": 1e-05,
|
15 |
+
"max_position_embeddings": 2048,
|
16 |
+
"model_type": "gpt_neox",
|
17 |
+
"num_attention_heads": 22,
|
18 |
+
"num_hidden_layers": 36,
|
19 |
+
"rotary_emb_base": 10000,
|
20 |
+
"rotary_pct": 1.0,
|
21 |
+
"tie_word_embeddings": false,
|
22 |
+
"torch_dtype": "float16",
|
23 |
+
"use_cache": true,
|
24 |
+
"use_parallel_residual": false,
|
25 |
+
"vocab_size": 65536
|
26 |
+
}
|
pytorch_model.bin
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:7a132c0f26a5ef44f6d794d7d55965b36817213e2974af402c5a9d12104f39c6
|
3 |
+
size 7743419069
|
rinna.png
ADDED
spiece.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:85a0205d37a98bb3b97cf4ca3f507c78873cf8f6cefa3b51d8d6a15006dc889d
|
3 |
+
size 1341798
|
spiece.vocab
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"eos_token": "</s>", "unk_token": "[UNK]", "pad_token": "[PAD]", "extra_ids": 0, "additional_special_tokens": [], "sp_model_kwargs": {}, "bos_token": "<s>", "cls_token": "[CLS]", "sep_token": "[SEP]", "mask_token": "[MASK]", "do_lower_case": false, "tokenizer_class": "T5Tokenizer"}
|