geniacllm
/

ja-en-tokenizer-unigram-v5

Model card Files Files and versions Community

miya-99999 commited on Apr 27

Commit

885b04f

•

1 Parent(s): 17a122a

Update README.md

Files changed (1) hide show

README.md +89 -0

README.md CHANGED Viewed

@@ -1,3 +1,92 @@
 ---
 license: cc-by-sa-4.0
 ---

 ---
 license: cc-by-sa-4.0
 ---
+---
+license: cc-by-sa-4.0
+---
+# v4からの修正点
+数字を全て一桁区切りに。
+# 説明
+wikipedia, mbpp, grade-school-mathで学習したトークナイザー。
+## 学習に使ったデータ
+- 英語：1.33GB (wiki40b)<br>
+- 日本語：1.78GB (wiki40b)　※形態素単位で"||||"で事前分割してsentencepieceの学習時にpretokenization_delimiterを設定。<br>
+- コード：172KB (mbpp) <br>
+- 数学：2.1MB (grade-school-math)
+-
+## 語彙の追加
+以下を参考に日本語の語彙を追加。
+- wikitinary 目次一覧（名詞・形容詞・形容動詞・副詞・接尾辞・助詞・動詞などから一般的に使われると思われるものを定性的に選別。）
+- wikitionary 日本語の基本語彙1000
+- 文化庁「常用漢字一覧表」の例から一部をサンプリング。
+- 時間・季節・方角に関する語
+- 都道府県・観光地・東京23区
+- 頻出する日本の苗字
+- 定型表現（「こんにちは」「よろしく」「ございます」など）
+その他、以下の語彙を追加。
+- 記号
+- 数字（漢数字・半角数字0~9・全角数字０〜９・上付き数字0〜9）
+- 数学に出るギリシャ文字
+## 語彙の割合
+概算ですが、アルファベットが約6割、日本語（ひらがな・カタカナ・漢字）が約4割となっています。（その他記号や数字は1~2%程度）
+## 参照
+- https://aclanthology.org/2020.lrec-1.297.pdf
+- https://www.tensorflow.org/datasets/catalog/wiki40b
+- https://github.com/openai/grade-school-math
+- https://github.com/google-research/google-research/tree/master/mbpp
+- https://www.bunka.go.jp/kokugo_nihongo/sisaku/joho/joho/kakuki/14/pdf/jyouyou_kanjihyou.pdf
+- https://ja.m.wiktionary.org/wiki/%E3%82%AB%E3%83%86%E3%82%B4%E3%83%AA:%E6%97%A5%E6%9C%AC%E8%AA%9E
+-
+## 設定
+vocab_size=56,320（語彙サイズ）<br>
+character_coverage=0.9995（文字のカバー率99.95%）<br>
+model_type="unigram"（アルゴリズム）<br>
+normalization="identity"（正規化なし）<br>
+byte_fallback=True（バイト変換あり）<br>
+split_digits=True（数字分割あり）<br>
+allow_whitespace_only_pieces=True（空白のトークンを許可する）<br>
+remove_extra_whitespaces=True（余分な空白の削除あり）<br>
+## 形式
+LlamaTokenizer<br>
+※encode時に文頭にbos_tokenである"\<s\>"トークンが付きます。
+# 使い方
+```python
+!pip install transformers>=4.34.0
+from transformers import AutoTokenizer
+test_tokenizer = AutoTokenizer.from_pretrained("geniacllm/ja-en-tokenizer-unigram-v5", use_fast=False)
+```
+```python
+# text
+text = "This is tokenizer test."
+# tokenize
+tokenized = test_tokenizer.tokenize(text)
+print(tokenized)
+# encode
+encoded = test_tokenizer.encode(text)
+print(encoded)
+# decode
+decoded = test_tokenizer.decode(encoded)
+print(decoded)
+# special_token
+print(test_tokenizer.special_tokens_map)
+# vocab size
+print(len(test_tokenizer))
+# all subwords in vocab
+print(test_tokenizer.get_vocab())
+```