aya-se commited on
Commit
a2556c1
·
1 Parent(s): bbe8711

Add preprint pdf

Browse files
Files changed (3) hide show
  1. README.md +2 -2
  2. README_ja.md +2 -2
  3. swallow-corpus-v2.pdf +0 -0
README.md CHANGED
@@ -22,7 +22,7 @@ This repository contains fastText classifiers for judging the educational value
22
 
23
  The Wiki-based classifier is distributed under the [CC BY-SA 4.0](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/CC_BY-SA_4.0.md) license, while the LLM-based classifier is distributed under the license of the LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md) or [Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)).
24
 
25
- These classifiers were developed as part of a quality-filtering process for the \*Swallow Corpus Version 2, which was used to train the [Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6) series. Our ablation experiments have shown that applying a filter based on the classifier’s scores improved the LLM’s ability related to Japanese knowledge.
26
 
27
  \* A large Japanese web corpus extracted from Common Crawl
28
 
@@ -85,7 +85,7 @@ This research is based on results obtained from a project, JPNP18002, commission
85
 
86
  ## Citation
87
 
88
- (Japanese only)
89
 
90
  ```bibtex
91
  @inproceedings{hattori-2025-swallow-v2,
 
22
 
23
  The Wiki-based classifier is distributed under the [CC BY-SA 4.0](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/CC_BY-SA_4.0.md) license, while the LLM-based classifier is distributed under the license of the LLM used for annotation ([Llama 3.1 Community License Agreement](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/LLAMA_3.1_COMMUNITY_LICENSE_AGREEMENT.md) or [Gemma Terms of Use](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/GEMMA_TERMS_OF_USE.md)).
24
 
25
+ These classifiers were employed for quality-filtering process in the Swallow Corpus Version 2\*, which was used to train the [Llama 3.1 Swallow](https://huggingface.co/collections/tokyotech-llm/llama-31-swallow-66fd4f7da32705cadd1d5bc6) series. Our experiments demonstrated that applying filtering based on the classifier’s scores enabled more effective improvements in the LLM’s Japanese knowledge, even with the same computational resources.
26
 
27
  \* A large Japanese web corpus extracted from Common Crawl
28
 
 
85
 
86
  ## Citation
87
 
88
+ The preprint is available [here](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/swallow-corpus-v2.pdf) (Japanese only)
89
 
90
  ```bibtex
91
  @inproceedings{hattori-2025-swallow-v2,
README_ja.md CHANGED
@@ -45,7 +45,7 @@ edu_score = sum([int(label[-1]) * prob for label, prob in zip(res[0], res[1])])
45
 
46
  ### ベストプラクティス
47
 
48
- 研究では、どちらの分類器も有効であることを確認していますが、多様な文書に適切なスコアを付与したい場合には、LLM-based classifier の使用を推奨します。Wiki-based classifier は Wikipedia らしさを測定するため、有用と判定される文書の範囲が限定され、ほとんどの文書に 0 付近のスコアを付与する傾向にあります。一方、LLM-based classifier はより教育的価値の定義に基づき、より広範囲な文書に対して採点できます。
49
 
50
  ## 訓練
51
 
@@ -69,7 +69,7 @@ Wikipedia 記事を教育的な文書の正例と見なし、分類器を構築
69
 
70
  ## 引用
71
 
72
- (日本語のみ)
73
 
74
  ```bibtex
75
  @inproceedings{hattori-2025-swallow-v2,
 
45
 
46
  ### ベストプラクティス
47
 
48
+ 研究では、どちらの分類器も有効であることを確認していますが、多様な文書に適切なスコアを付与したい場合には、LLM-based classifier の使用を推奨します。Wiki-based classifier は Wikipedia らしさを測定するため、有用と判定される文書の範囲が限定され、ほとんどの文書に 0 付近のスコアを付与する傾向にあります。一方、LLM-based classifier は一般的な教育的価値の定義に基づき、より広範囲な文書に対して採点できます。
49
 
50
  ## 訓練
51
 
 
69
 
70
  ## 引用
71
 
72
+ 原稿のプレプリントは[こちら](https://huggingface.co/tokyotech-llm/edu-classifier/blob/main/swallow-corpus-v2.pdf)(日本語のみ)
73
 
74
  ```bibtex
75
  @inproceedings{hattori-2025-swallow-v2,
swallow-corpus-v2.pdf ADDED
Binary file (498 kB). View file