ptaszynski
/

yacis-electra-small-japanese-cyberbullying

Text Classification

Transformers

PyTorch

Japanese

electra

Inference Endpoints

Model card Files Files and versions Community

ptaszynski commited on Jan 12, 2022

Commit

bcce228

1 Parent(s): 015ee18

Update README.md

Browse files

Files changed (1) hide show

README.md +30 -11

README.md CHANGED Viewed

@@ -20,17 +20,9 @@ The original foundation model was originally pretrained on 5.6 billion words [YA
 ## Model architecture
-The original model was pretrained using ELECTRA Small model settings can be found here:
 [https://huggingface.co/ptaszynski/yacis-electra-small-japanese](https://huggingface.co/ptaszynski/yacis-electra-small-japanese)
-## Training data and libraries
-YACIS-ELECTRA is trained on the whole of [YACIS](https://github.com/ptaszynski/yacis-corpus) blog corpus, which is a Japanese blog corpus containing 5.6 billion words in 354 million sentences.
-The corpus was originally split into sentences using custom rules, and each sentence was tokenized using [MeCab](https://taku910.github.io/mecab/). Subword tokenization for pretraining was done with WordPiece.
-We used original [ELECTRA](https://github.com/google-research/electra) repository for pretraining. The pretrainig process took 7 days and 6 hours under the following environment: CPU: Intel Core i9-7920X, RAM: 132 GB, GPU: GeForce GTX 1080 Ti x1.
 ## Licenses
@@ -40,7 +32,7 @@ The pretrained model with all attached files is distributed under the terms of t
 ## Citations
-Please, cite the model using the following citation.
 ```
 @inproceedings{shibata2022yacis-electra,
@@ -55,9 +47,36 @@ Please, cite the model using the following citation.
 }
 ```
-The model was build using sentences from YACIS corpus, which should be cited using at least one of the following refrences.
 ```
 @inproceedings{ptaszynski2012yacis,
   title={YACIS: A five-billion-word corpus of Japanese blogs fully annotated with syntactic and affective information},

 ## Model architecture
+The original model was pretrained using ELECTRA Small model settings and can be found here:
 [https://huggingface.co/ptaszynski/yacis-electra-small-japanese](https://huggingface.co/ptaszynski/yacis-electra-small-japanese)
 ## Licenses
 ## Citations
+Please, cite this model using the following citation.
 ```
 @inproceedings{shibata2022yacis-electra,
 }
 ```
+The two datasets used for finetuning should be cited using the following references.
+- Harmful BBS Japanese comments dataset:
+```
+@book{ptaszynski2018automatic,
+  title={Automatic Cyberbullying Detection: Emerging Research and Opportunities: Emerging Research and Opportunities},
+  author={Ptaszynski, Michal E and Masui, Fumito},
+  year={2018},
+  publisher={IGI Global}
+}
+```
+```
+@article{松葉達明2009学校非公式サイトにおける有害情報検出,
+  title={学校非公式サイトにおける有害情報検出},
+  author={松葉達明 and 里見尚宏 and 桝井文人 and 河合敦夫 and 井須尚紀},
+  journal={電子情報通信学会技術研究報告. NLC, 言語理解とコミュニケーション},
+  volume={109},
+  number={142},
+  pages={93--98},
+  year={2009},
+  publisher={一般社団法人電子情報通信学会}
+}
+```
+- Twitter Japanese cyberbullying dataset:
+```
+TBA
+```
+The pretraining was done using YACIS corpus, which should be cited using at least one of the following references.
 ```
 @inproceedings{ptaszynski2012yacis,
   title={YACIS: A five-billion-word corpus of Japanese blogs fully annotated with syntactic and affective information},