ksshumab commited on
Commit
9254b54
·
verified ·
1 Parent(s): ad447eb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -0
README.md CHANGED
@@ -1,6 +1,12 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
4
  ## Model Summary
5
  This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: [Predictive Data Selection: The Data That Predicts Is the Data That Teaches
6
  ](). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%.
@@ -40,3 +46,15 @@ dist_executor.run()
40
  ## Training
41
  For more training details, you can refer to the paper and the training code is available on GitHub
42
  [PreSelect](https://github.com/hkust-nlp/preselect).
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+ <p align="center">
5
+ 📑 <a href="https://arxiv.org/abs/2503.00808" target="_blank">Paper</a> &nbsp&nbsp | &nbsp&nbsp 🔨 <a href="https://huggingface.co/hkust-nlp/preselect-fasttext-classifier" target="_blank">fastText Classifier</a> &nbsp&nbsp | &nbsp&nbsp 🤗 <a href="https://huggingface.co/datasets/hkust-nlp/PreSelect-100B" target="_blank">Released Dataset</a> &nbsp&nbsp | &nbsp&nbsp 📦 <a href="https://github.com/hkust-nlp/PreSelect" target="_blank">Repo</a>
6
+ <br>
7
+ </p>
8
+
9
+
10
  ## Model Summary
11
  This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: [Predictive Data Selection: The Data That Predicts Is the Data That Teaches
12
  ](). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%.
 
46
  ## Training
47
  For more training details, you can refer to the paper and the training code is available on GitHub
48
  [PreSelect](https://github.com/hkust-nlp/preselect).
49
+
50
+ ## Citation
51
+ If you find this work helpful, please kindly cite as:
52
+ ```
53
+ @article{shum2025predictivedataselectiondata,
54
+ title={Predictive Data Selection: The Data That Predicts Is the Data That Teaches},
55
+ author={Kashun Shum and Yuzhen Huang and Hongjian Zou and Ding Qi and Yixuan Liao and Xiaoxin Chen and Qian Liu and Junxian He},
56
+ journal={arXiv preprint arXiv:2503.00808},
57
+ year={2025},
58
+ eprint={2503.00808},
59
+ }
60
+ ```