Add preprint pdf
Browse files- README.md +2 -2
- README_ja.md +2 -2
- swallow-corpus-v2.pdf +0 -0
README.md
CHANGED
@@ -22,7 +22,7 @@ This repository contains fastText classifiers for judging the educational value
|
|
22 |
|
23 |
The Wiki-based classifier is distributed under the [CC BY-SA 4.0](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/CC_BY-SA_4.0.md) license, while the LLM-based classifier is distributed under the license of the LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md) or [Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)).
|
24 |
|
25 |
-
These classifiers were
|
26 |
|
27 |
\* A large Japanese web corpus extracted from Common Crawl
|
28 |
|
@@ -85,7 +85,7 @@ This research is based on results obtained from a project, JPNP18002, commission
|
|
85 |
|
86 |
## Citation
|
87 |
|
88 |
-
(Japanese only)
|
89 |
|
90 |
```bibtex
|
91 |
@inproceedings{hattori-2025-swallow-v2,
|
|
|
22 |
|
23 |
The Wiki-based classifier is distributed under the [CC BY-SA 4.0](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/CC_BY-SA_4.0.md) license, while the LLM-based classifier is distributed under the license of the LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md) or [Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)).
|
24 |
|
25 |
+
These classifiers were employed for quality-filtering process in the Swallow Corpus Version 2\*, which was used to train the [Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6) series. Our experiments demonstrated that applying filtering based on the classifier’s scores enabled more effective improvements in the LLM’s Japanese knowledge, even with the same computational resources.
|
26 |
|
27 |
\* A large Japanese web corpus extracted from Common Crawl
|
28 |
|
|
|
85 |
|
86 |
## Citation
|
87 |
|
88 |
+
The preprint is available [here](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/swallow-corpus-v2.pdf) (Japanese only)
|
89 |
|
90 |
```bibtex
|
91 |
@inproceedings{hattori-2025-swallow-v2,
|
README_ja.md
CHANGED
@@ -45,7 +45,7 @@ edu_score = sum([int(label[-1]) * prob for label, prob in zip(res[0], res[1])])
|
|
45 |
|
46 |
### ベストプラクティス
|
47 |
|
48 |
-
研究では、どちらの分類器も有効であることを確認していますが、多様な文書に適切なスコアを付与したい場合には、LLM-based classifier の使用を推奨します。Wiki-based classifier は Wikipedia らしさを測定するため、有用と判定される文書の範囲が限定され、ほとんどの文書に 0 付近のスコアを付与する傾向にあります。一方、LLM-based classifier
|
49 |
|
50 |
## 訓練
|
51 |
|
@@ -69,7 +69,7 @@ Wikipedia 記事を教育的な文書の正例と見なし、分類器を構築
|
|
69 |
|
70 |
## 引用
|
71 |
|
72 |
-
(日本語のみ)
|
73 |
|
74 |
```bibtex
|
75 |
@inproceedings{hattori-2025-swallow-v2,
|
|
|
45 |
|
46 |
### ベストプラクティス
|
47 |
|
48 |
+
研究では、どちらの分類器も有効であることを確認していますが、多様な文書に適切なスコアを付与したい場合には、LLM-based classifier の使用を推奨します。Wiki-based classifier は Wikipedia らしさを測定するため、有用と判定される文書の範囲が限定され、ほとんどの文書に 0 付近のスコアを付与する傾向にあります。一方、LLM-based classifier は一般的な教育的価値の定義に基づき、より広範囲な文書に対して採点できます。
|
49 |
|
50 |
## 訓練
|
51 |
|
|
|
69 |
|
70 |
## 引用
|
71 |
|
72 |
+
原稿のプレプリントは[こちら](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/swallow-corpus-v2.pdf)(日本語のみ)
|
73 |
|
74 |
```bibtex
|
75 |
@inproceedings{hattori-2025-swallow-v2,
|
swallow-corpus-v2.pdf
ADDED
Binary file (498 kB). View file
|
|