ptaszynski commited on
Commit
bcce228
·
1 Parent(s): 015ee18

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -11
README.md CHANGED
@@ -20,17 +20,9 @@ The original foundation model was originally pretrained on 5.6 billion words [YA
20
 
21
  ## Model architecture
22
 
23
- The original model was pretrained using ELECTRA Small model settings can be found here:
24
  [https://huggingface.co/ptaszynski/yacis-electra-small-japanese](https://huggingface.co/ptaszynski/yacis-electra-small-japanese)
25
 
26
- ## Training data and libraries
27
-
28
- YACIS-ELECTRA is trained on the whole of [YACIS](https://github.com/ptaszynski/yacis-corpus) blog corpus, which is a Japanese blog corpus containing 5.6 billion words in 354 million sentences.
29
-
30
- The corpus was originally split into sentences using custom rules, and each sentence was tokenized using [MeCab](https://taku910.github.io/mecab/). Subword tokenization for pretraining was done with WordPiece.
31
-
32
- We used original [ELECTRA](https://github.com/google-research/electra) repository for pretraining. The pretrainig process took 7 days and 6 hours under the following environment: CPU: Intel Core i9-7920X, RAM: 132 GB, GPU: GeForce GTX 1080 Ti x1.
33
-
34
 
35
  ## Licenses
36
 
@@ -40,7 +32,7 @@ The pretrained model with all attached files is distributed under the terms of t
40
 
41
  ## Citations
42
 
43
- Please, cite the model using the following citation.
44
 
45
  ```
46
  @inproceedings{shibata2022yacis-electra,
@@ -55,9 +47,36 @@ Please, cite the model using the following citation.
55
  }
56
  ```
57
 
 
58
 
59
- The model was build using sentences from YACIS corpus, which should be cited using at least one of the following refrences.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
 
61
  ```
62
  @inproceedings{ptaszynski2012yacis,
63
  title={YACIS: A five-billion-word corpus of Japanese blogs fully annotated with syntactic and affective information},
 
20
 
21
  ## Model architecture
22
 
23
+ The original model was pretrained using ELECTRA Small model settings and can be found here:
24
  [https://huggingface.co/ptaszynski/yacis-electra-small-japanese](https://huggingface.co/ptaszynski/yacis-electra-small-japanese)
25
 
 
 
 
 
 
 
 
 
26
 
27
  ## Licenses
28
 
 
32
 
33
  ## Citations
34
 
35
+ Please, cite this model using the following citation.
36
 
37
  ```
38
  @inproceedings{shibata2022yacis-electra,
 
47
  }
48
  ```
49
 
50
+ The two datasets used for finetuning should be cited using the following references.
51
 
52
+ - Harmful BBS Japanese comments dataset:
53
+ ```
54
+ @book{ptaszynski2018automatic,
55
+ title={Automatic Cyberbullying Detection: Emerging Research and Opportunities: Emerging Research and Opportunities},
56
+ author={Ptaszynski, Michal E and Masui, Fumito},
57
+ year={2018},
58
+ publisher={IGI Global}
59
+ }
60
+ ```
61
+ ```
62
+ @article{松葉達明2009学校非公式サイトにおける有害情報検出,
63
+ title={学校非公式サイトにおける有害情報検出},
64
+ author={松葉達明 and 里見尚宏 and 桝井文人 and 河合敦夫 and 井須尚紀},
65
+ journal={電子情報通信学会技術研究報告. NLC, 言語理解とコミュニケーション},
66
+ volume={109},
67
+ number={142},
68
+ pages={93--98},
69
+ year={2009},
70
+ publisher={一般社団法人電子情報通信学会}
71
+ }
72
+ ```
73
+
74
+ - Twitter Japanese cyberbullying dataset:
75
+ ```
76
+ TBA
77
+ ```
78
 
79
+ The pretraining was done using YACIS corpus, which should be cited using at least one of the following references.
80
  ```
81
  @inproceedings{ptaszynski2012yacis,
82
  title={YACIS: A five-billion-word corpus of Japanese blogs fully annotated with syntactic and affective information},