What’s this?
日本語リソースで学習した DeBERTa V3 モデルです。
以下のような特徴を持ちます:
- 定評のある DeBERTa V3 を用いたモデル
- 日本語特化
- 推論時に形態素解析器を用いない
- 単語境界をある程度尊重する (
の都合上
やの判定負けを喫し
のような複数語のトークンを生じさせない)
This is a model based on DeBERTa V3 pre-trained on Japanese resources.
The model has the following features:
- Based on the well-known DeBERTa V3 model
- Specialized for the Japanese language
- Does not use a morphological analyzer during inference
- Respects word boundaries to some extent (does not produce tokens spanning multiple words like
の都合上
orの判定負けを喫し
)
How to use
from transformers import AutoTokenizer, AutoModelForTokenClassification
model_name = 'globis-university/deberta-v3-japanese-xsmall'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
Tokenizer
工藤氏によって示された手法で学習しました。
以下のことを意識しています:
- 推論時の形態素解析器なし
- トークンが単語の境界を跨がない (辞書:
unidic-cwj-202302
) - Hugging Faceで使いやすい
- 大きすぎない語彙数
本家の DeBERTa V3 は大きな語彙数で学習されていることに特徴がありますが、反面埋め込み層のパラメータ数が大きくなりすぎる (microsoft/deberta-v3-base モデルの場合で埋め込み層が全体の 54%) ことから、本モデルでは小さめの語彙数を採用しています。
注意点として、 xsmall
、 base
、 large
の 3 つのモデルのうち、前者二つは unigram アルゴリズムで学習しているが、 large
モデルのみ BPE アルゴリズムで学習している。
深い理由はなく、 large
モデルのみ語彙サイズを増やすために独立して学習を行ったが、なぜか unigram アルゴリズムでの学習がうまくいかなかったことが原因である。
原因の探究よりモデルの完成を優先して、 BPE アルゴリズムに切り替えた。
The tokenizer is trained using the method introduced by Kudo.
Key points include:
- No morphological analyzer needed during inference
- Tokens do not cross word boundaries (dictionary:
unidic-cwj-202302
) - Easy to use with Hugging Face
- Smaller vocabulary size
Although the original DeBERTa V3 is characterized by a large vocabulary size, which can result in a significant increase in the number of parameters in the embedding layer (for the microsoft/deberta-v3-base model, the embedding layer accounts for 54% of the total), this model adopts a smaller vocabulary size to address this.
Note that, among the three models: xsmall, base, and large, the first two were trained using the unigram algorithm, while only the large model was trained using the BPE algorithm. The reason for this is simple: while the large model was independently trained to increase its vocabulary size, for some reason, training with the unigram algorithm was not successful. Thus, prioritizing the completion of the model over investigating the cause, we switched to the BPE algorithm.
Data
Dataset Name | Notes | File Size (with metadata) | Factor |
---|---|---|---|
Wikipedia | 2023/07; WikiExtractor | 3.5GB | x2 |
Wikipedia | 2023/07; cl-tohoku's method | 4.8GB | x2 |
WikiBooks | 2023/07; cl-tohoku's method | 43MB | x2 |
Aozora Bunko | 2023/07; globis-university/aozorabunko-clean | 496MB | x4 |
CC-100 | ja | 90GB | x1 |
mC4 | ja; extracted 10%, with Wikipedia-like focus via DSIR | 91GB | x1 |
OSCAR 2023 | ja; extracted 10%, with Wikipedia-like focus via DSIR | 26GB | x1 |
Training parameters
- Number of devices: 8
- Batch size: 48 x 8
- Learning rate: 3.84e-4
- Maximum sequence length: 512
- Optimizer: AdamW
- Learning rate scheduler: Linear schedule with warmup
- Training steps: 1,000,000
- Warmup steps: 100,000
- Precision: Mixed (fp16)
- Vocabulary size: 32,000
Evaluation
Model | #params | JSTS | JNLI | JSQuAD | JCQA |
---|---|---|---|---|---|
≤ small | |||||
izumi-lab/deberta-v2-small-japanese | 17.8M | 0.890/0.846 | 0.880 | - | 0.737 |
globis-university/deberta-v3-japanese-xsmall | 33.7M | 0.916/0.880 | 0.913 | 0.869/0.938 | 0.821 |
base | |||||
cl-tohoku/bert-base-japanese-v3 | 111M | 0.919/0.881 | 0.907 | 0.880/0.946 | 0.848 |
nlp-waseda/roberta-base-japanese | 111M | 0.913/0.873 | 0.895 | 0.864/0.927 | 0.840 |
izumi-lab/deberta-v2-base-japanese | 110M | 0.919/0.882 | 0.912 | - | 0.859 |
ku-nlp/deberta-v2-base-japanese | 112M | 0.922/0.886 | 0.922 | 0.899/0.951 | - |
ku-nlp/deberta-v3-base-japanese | 160M | 0.927/0.891 | 0.927 | 0.896/- | - |
globis-university/deberta-v3-japanese-base | 110M | 0.925/0.895 | 0.921 | 0.890/0.950 | 0.886 |
large | |||||
cl-tohoku/bert-large-japanese-v2 | 337M | 0.926/0.893 | 0.929 | 0.893/0.956 | 0.893 |
nlp-waseda/roberta-large-japanese | 337M | 0.930/0.896 | 0.924 | 0.884/0.940 | 0.907 |
nlp-waseda/roberta-large-japanese-seq512 | 337M | 0.926/0.892 | 0.926 | 0.918/0.963 | 0.891 |
ku-nlp/deberta-v2-large-japanese | 339M | 0.925/0.892 | 0.924 | 0.912/0.959 | - |
globis-university/deberta-v3-japanese-large | 352M | 0.928/0.896 | 0.924 | 0.896/0.956 | 0.900 |
License
CC BY SA 4.0
Acknowledgement
計算リソースに ABCI を利用させていただきました。ありがとうございます。
We used ABCI for computing resources. Thank you.
- Downloads last month
- 435