lighttransport
/

japanese-tokenizer-cc100

Model card Files Files and versions Community

日本語データセットで train した Tokenizer です.

単体での利用は想定しておらず, LLaMa Tokenizer などにマージして利用するのを想定しています.

Training script

train_jp_tokenizer.py を参照ください.

Trained tokenizer

tokenizer-cc100-ja.json cc100 ja データセットをそのまま(normalize など適用せずに) train したもの. vocab size 30000.

TODO

Normalize した日本語テキストに対して train する
マージした Tokenizer をアップロードする

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model is not currently available via any of the supported Inference Providers.

The model cannot be deployed to the HF Inference API: The model has no library tag.