Add preprint pdf

Files changed (3) hide show

README.md CHANGED Viewed

@@ -22,7 +22,7 @@ This repository contains fastText classifiers for judging the educational value
 The Wiki-based classifier is distributed under the [CC BY-SA 4.0](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/CC_BY-SA_4.0.md) license, while the LLM-based classifier is distributed under the license of the LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md) or [Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)).
-These classifiers were developed as part of a quality-filtering process for the \*Swallow Corpus Version 2, which was used to train the [Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6) series. Our ablation experiments have shown that applying a filter based on the classifier’s scores improved the LLM’s ability related to Japanese knowledge.
 \* A large Japanese web corpus extracted from Common Crawl
@@ -85,7 +85,7 @@ This research is based on results obtained from a project, JPNP18002, commission
 ## Citation
-(Japanese only)
 ```bibtex
 @inproceedings{hattori-2025-swallow-v2,

 The Wiki-based classifier is distributed under the [CC BY-SA 4.0](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/CC_BY-SA_4.0.md) license, while the LLM-based classifier is distributed under the license of the LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md) or [Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)).
+These classifiers were employed for quality-filtering process in the Swallow Corpus Version 2\*, which was used to train the [Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6) series. Our experiments demonstrated that applying filtering based on the classifier’s scores enabled more effective improvements in the LLM’s Japanese knowledge, even with the same computational resources.
 \* A large Japanese web corpus extracted from Common Crawl
 ## Citation
+The preprint is available [here](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/swallow-corpus-v2.pdf) (Japanese only)
 ```bibtex
 @inproceedings{hattori-2025-swallow-v2,

README_ja.md CHANGED Viewed

@@ -45,7 +45,7 @@ edu_score = sum([int(label[-1]) * prob for label, prob in zip(res[0], res[1])])
 ### ベストプラクティス
-研究では、どちらの分類器も有効であることを確認していますが、多様な文書に適切なスコアを付与したい場合には、LLM-based classifier の使用を推奨します。Wiki-based classifier は Wikipedia らしさを測定するため、有用と判定される文書の範囲が限定され、ほとんどの文書に 0 付近のスコアを付与する傾向にあります。一方、LLM-based classifier はより教育的価値の定義に基づき、より広範囲な文書に対して採点できます。
 ## 訓練
@@ -69,7 +69,7 @@ Wikipedia 記事を教育的な文書の正例と見なし、分類器を構築
 ## 引用
-（日本語のみ）
 ```bibtex
 @inproceedings{hattori-2025-swallow-v2,

 ### ベストプラクティス
+研究では、どちらの分類器も有効であることを確認していますが、多様な文書に適切なスコアを付与したい場合には、LLM-based classifier の使用を推奨します。Wiki-based classifier は Wikipedia らしさを測定するため、有用と判定される文書の範囲が限定され、ほとんどの文書に 0 付近のスコアを付与する傾向にあります。一方、LLM-based classifier は一般的な教育的価値の定義に基づき、より広範囲な文書に対して採点できます。
 ## 訓練
 ## 引用
+原稿のプレプリントは[こちら](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/swallow-corpus-v2.pdf)（日本語のみ）
 ```bibtex
 @inproceedings{hattori-2025-swallow-v2,

swallow-corpus-v2.pdf ADDED Viewed

Binary file (498 kB). View file