hkust-nlp
/

preselect-fasttext-classifier

Text Classification

Model card Files Files and versions Community

ksshumab commited on 17 days ago

Commit

95d1d67

·

verified ·

1 Parent(s): 8f9c96c

Update README.md

Files changed (1) hide show

README.md +42 -3

README.md CHANGED Viewed

@@ -1,3 +1,42 @@
----
-license: mit
----

+---
+license: mit
+---
+## Model Summary
+This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper:  [Predictive Data Selection: The Data That Predicts Is the Data That Teaches
+](). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%.
+The positive label name and negative label name are "__label__1" and "__label__0" respectively.
+## How to use
+You can refer to the code repo of the paper to directly run the filtering with any fastText model or simply:
+```python
+import os
+import argparse
+from pathlib import Path
+parser = argparse.ArgumentParser("Filter")
+parser.add_argument("--input_path",type=str, help="input path name")
+parser.add_argument("--output_path",type=str, help="output name")
+args = parser.parse_args()
+from datatrove.executor import LocalPipelineExecutor
+from datatrove.pipeline.filters import FastTextClassifierFilter
+from datatrove.pipeline.readers import ParquetReader,JsonlReader
+from datatrove.pipeline.writers.jsonl import JsonlWriter
+Path(f"{args.output_path}").mkdir(parents=True,exist_ok=True)
+dist_executor = LocalPipelineExecutor(
+    skip_completed=False,
+    pipeline=[
+        JsonlReader(f"{args.input_path}", text_key="text", default_metadata= {}),
+        FastTextClassifierFilter(f"PreSelect-classifier.bin", keep_labels=[("1",0.5)]),
+        JsonlWriter(f"{args.output_path}", compression=None)
+    ],
+    tasks=100,
+)
+dist_executor.run()
+```
+## Training
+For more training details, you can refer to the paper and the training code is available on GitHub
+[PreSelect](https://github.com/hkust-nlp/preselect).