Japanese
Syoyo Fujita commited on
Commit
660f34c
·
1 Parent(s): 71f0521
Files changed (3) hide show
  1. README.md +17 -0
  2. tokenizer-cc100-ja.json +0 -0
  3. train_jp_tokenizer.py +25 -0
README.md CHANGED
@@ -1,3 +1,20 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+
5
+ 日本語データセットで train した Tokenizer です.
6
+
7
+ 単体での利用は想定しておらず, LLaMa Tokenizer などにマージして利用するのを想定しています.
8
+
9
+ ## Training script
10
+
11
+ `train_jp_tokenizer.py` を参照ください.
12
+
13
+ ## Trained tokenizer
14
+
15
+ * `tokenizer-cc100-ja.json`
16
+ cc100 ja データセットをそのまま(normalize など適用せずに) train したもの. vocab size 30000.
17
+
18
+ ## TODO
19
+
20
+ * [ ] Normalize した日本語テキストに対して train する
tokenizer-cc100-ja.json ADDED
The diff for this file is too large to render. See raw diff
 
train_jp_tokenizer.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # NOTE: 128 GB CPU mem is required.
2
+ from tokenizers import Tokenizer
3
+ from tokenizers.models import BPE
4
+ from tokenizers.trainers import BpeTrainer
5
+ from tokenizers.pre_tokenizers import Whitespace
6
+ from datasets import load_dataset
7
+
8
+ tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
9
+ tokenizer.pre_tokenizer = Whitespace()
10
+
11
+ # TODO: Use [BOS], [EOS] instead of [CLS], [SEP]?
12
+ # NOTE: Chinese LLaMa uses vocab_size=20000
13
+ trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=30000)
14
+
15
+ dataset = load_dataset('range3/cc100-ja')
16
+
17
+ def dataset_iter():
18
+ # roughly 700MB
19
+ # reducing `skip` will cause OOM if you have less than 128 GB CPU mem.
20
+ skip=100
21
+ for i in range(0, len(dataset['train']), skip):
22
+ yield dataset['train'][i]['text']
23
+
24
+ tokenizer.train_from_iterator(dataset_iter(), trainer)
25
+ tokenizer.save('data/tokenizer-cc100-ja.json')