Syoyo Fujita
commited on
Commit
·
660f34c
1
Parent(s):
71f0521
Initial.
Browse files- README.md +17 -0
- tokenizer-cc100-ja.json +0 -0
- train_jp_tokenizer.py +25 -0
README.md
CHANGED
@@ -1,3 +1,20 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
---
|
4 |
+
|
5 |
+
日本語データセットで train した Tokenizer です.
|
6 |
+
|
7 |
+
単体での利用は想定しておらず, LLaMa Tokenizer などにマージして利用するのを想定しています.
|
8 |
+
|
9 |
+
## Training script
|
10 |
+
|
11 |
+
`train_jp_tokenizer.py` を参照ください.
|
12 |
+
|
13 |
+
## Trained tokenizer
|
14 |
+
|
15 |
+
* `tokenizer-cc100-ja.json`
|
16 |
+
cc100 ja データセットをそのまま(normalize など適用せずに) train したもの. vocab size 30000.
|
17 |
+
|
18 |
+
## TODO
|
19 |
+
|
20 |
+
* [ ] Normalize した日本語テキストに対して train する
|
tokenizer-cc100-ja.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
train_jp_tokenizer.py
ADDED
@@ -0,0 +1,25 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# NOTE: 128 GB CPU mem is required.
|
2 |
+
from tokenizers import Tokenizer
|
3 |
+
from tokenizers.models import BPE
|
4 |
+
from tokenizers.trainers import BpeTrainer
|
5 |
+
from tokenizers.pre_tokenizers import Whitespace
|
6 |
+
from datasets import load_dataset
|
7 |
+
|
8 |
+
tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
|
9 |
+
tokenizer.pre_tokenizer = Whitespace()
|
10 |
+
|
11 |
+
# TODO: Use [BOS], [EOS] instead of [CLS], [SEP]?
|
12 |
+
# NOTE: Chinese LLaMa uses vocab_size=20000
|
13 |
+
trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=30000)
|
14 |
+
|
15 |
+
dataset = load_dataset('range3/cc100-ja')
|
16 |
+
|
17 |
+
def dataset_iter():
|
18 |
+
# roughly 700MB
|
19 |
+
# reducing `skip` will cause OOM if you have less than 128 GB CPU mem.
|
20 |
+
skip=100
|
21 |
+
for i in range(0, len(dataset['train']), skip):
|
22 |
+
yield dataset['train'][i]['text']
|
23 |
+
|
24 |
+
tokenizer.train_from_iterator(dataset_iter(), trainer)
|
25 |
+
tokenizer.save('data/tokenizer-cc100-ja.json')
|