Commit
·
bcce228
1
Parent(s):
015ee18
Update README.md
Browse files
README.md
CHANGED
@@ -20,17 +20,9 @@ The original foundation model was originally pretrained on 5.6 billion words [YA
|
|
20 |
|
21 |
## Model architecture
|
22 |
|
23 |
-
The original model was pretrained using ELECTRA Small model settings can be found here:
|
24 |
[https://huggingface.co/ptaszynski/yacis-electra-small-japanese](https://huggingface.co/ptaszynski/yacis-electra-small-japanese)
|
25 |
|
26 |
-
## Training data and libraries
|
27 |
-
|
28 |
-
YACIS-ELECTRA is trained on the whole of [YACIS](https://github.com/ptaszynski/yacis-corpus) blog corpus, which is a Japanese blog corpus containing 5.6 billion words in 354 million sentences.
|
29 |
-
|
30 |
-
The corpus was originally split into sentences using custom rules, and each sentence was tokenized using [MeCab](https://taku910.github.io/mecab/). Subword tokenization for pretraining was done with WordPiece.
|
31 |
-
|
32 |
-
We used original [ELECTRA](https://github.com/google-research/electra) repository for pretraining. The pretrainig process took 7 days and 6 hours under the following environment: CPU: Intel Core i9-7920X, RAM: 132 GB, GPU: GeForce GTX 1080 Ti x1.
|
33 |
-
|
34 |
|
35 |
## Licenses
|
36 |
|
@@ -40,7 +32,7 @@ The pretrained model with all attached files is distributed under the terms of t
|
|
40 |
|
41 |
## Citations
|
42 |
|
43 |
-
Please, cite
|
44 |
|
45 |
```
|
46 |
@inproceedings{shibata2022yacis-electra,
|
@@ -55,9 +47,36 @@ Please, cite the model using the following citation.
|
|
55 |
}
|
56 |
```
|
57 |
|
|
|
58 |
|
59 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
|
|
|
61 |
```
|
62 |
@inproceedings{ptaszynski2012yacis,
|
63 |
title={YACIS: A five-billion-word corpus of Japanese blogs fully annotated with syntactic and affective information},
|
|
|
20 |
|
21 |
## Model architecture
|
22 |
|
23 |
+
The original model was pretrained using ELECTRA Small model settings and can be found here:
|
24 |
[https://huggingface.co/ptaszynski/yacis-electra-small-japanese](https://huggingface.co/ptaszynski/yacis-electra-small-japanese)
|
25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
26 |
|
27 |
## Licenses
|
28 |
|
|
|
32 |
|
33 |
## Citations
|
34 |
|
35 |
+
Please, cite this model using the following citation.
|
36 |
|
37 |
```
|
38 |
@inproceedings{shibata2022yacis-electra,
|
|
|
47 |
}
|
48 |
```
|
49 |
|
50 |
+
The two datasets used for finetuning should be cited using the following references.
|
51 |
|
52 |
+
- Harmful BBS Japanese comments dataset:
|
53 |
+
```
|
54 |
+
@book{ptaszynski2018automatic,
|
55 |
+
title={Automatic Cyberbullying Detection: Emerging Research and Opportunities: Emerging Research and Opportunities},
|
56 |
+
author={Ptaszynski, Michal E and Masui, Fumito},
|
57 |
+
year={2018},
|
58 |
+
publisher={IGI Global}
|
59 |
+
}
|
60 |
+
```
|
61 |
+
```
|
62 |
+
@article{松葉達明2009学校非公式サイトにおける有害情報検出,
|
63 |
+
title={学校非公式サイトにおける有害情報検出},
|
64 |
+
author={松葉達明 and 里見尚宏 and 桝井文人 and 河合敦夫 and 井須尚紀},
|
65 |
+
journal={電子情報通信学会技術研究報告. NLC, 言語理解とコミュニケーション},
|
66 |
+
volume={109},
|
67 |
+
number={142},
|
68 |
+
pages={93--98},
|
69 |
+
year={2009},
|
70 |
+
publisher={一般社団法人電子情報通信学会}
|
71 |
+
}
|
72 |
+
```
|
73 |
+
|
74 |
+
- Twitter Japanese cyberbullying dataset:
|
75 |
+
```
|
76 |
+
TBA
|
77 |
+
```
|
78 |
|
79 |
+
The pretraining was done using YACIS corpus, which should be cited using at least one of the following references.
|
80 |
```
|
81 |
@inproceedings{ptaszynski2012yacis,
|
82 |
title={YACIS: A five-billion-word corpus of Japanese blogs fully annotated with syntactic and affective information},
|