ksshumab commited on
Commit
95d1d67
·
verified ·
1 Parent(s): 8f9c96c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -3
README.md CHANGED
@@ -1,3 +1,42 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ ## Model Summary
5
+ This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: [Predictive Data Selection: The Data That Predicts Is the Data That Teaches
6
+ ](). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%.
7
+ The positive label name and negative label name are "__label__1" and "__label__0" respectively.
8
+
9
+ ## How to use
10
+ You can refer to the code repo of the paper to directly run the filtering with any fastText model or simply:
11
+
12
+ ```python
13
+ import os
14
+ import argparse
15
+ from pathlib import Path
16
+
17
+ parser = argparse.ArgumentParser("Filter")
18
+ parser.add_argument("--input_path",type=str, help="input path name")
19
+ parser.add_argument("--output_path",type=str, help="output name")
20
+
21
+ args = parser.parse_args()
22
+ from datatrove.executor import LocalPipelineExecutor
23
+ from datatrove.pipeline.filters import FastTextClassifierFilter
24
+ from datatrove.pipeline.readers import ParquetReader,JsonlReader
25
+ from datatrove.pipeline.writers.jsonl import JsonlWriter
26
+ Path(f"{args.output_path}").mkdir(parents=True,exist_ok=True)
27
+
28
+ dist_executor = LocalPipelineExecutor(
29
+ skip_completed=False,
30
+ pipeline=[
31
+ JsonlReader(f"{args.input_path}", text_key="text", default_metadata= {}),
32
+ FastTextClassifierFilter(f"PreSelect-classifier.bin", keep_labels=[("1",0.5)]),
33
+ JsonlWriter(f"{args.output_path}", compression=None)
34
+ ],
35
+ tasks=100,
36
+ )
37
+ dist_executor.run()
38
+ ```
39
+
40
+ ## Training
41
+ For more training details, you can refer to the paper and the training code is available on GitHub
42
+ [PreSelect](https://github.com/hkust-nlp/preselect).