|
--- |
|
license: mit |
|
--- |
|
## Model Summary |
|
This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: [Predictive Data Selection: The Data That Predicts Is the Data That Teaches |
|
](). And this is also the classifier we used to build [PreSelect-100B](https://huggingface.co/datasets/hkust-nlp/PreSelect-100B) dataset with a selection threshold of 10%. |
|
The positive label name and negative label name are "__label__1" and "__label__0" respectively. |
|
|
|
## How to use |
|
You can refer to the code repo of the paper to directly run the filtering with any fastText model or simply: |
|
|
|
```python |
|
import os |
|
import argparse |
|
from pathlib import Path |
|
|
|
parser = argparse.ArgumentParser("Filter") |
|
parser.add_argument("--input_path",type=str, help="input path name") |
|
parser.add_argument("--output_path",type=str, help="output name") |
|
|
|
args = parser.parse_args() |
|
from datatrove.executor import LocalPipelineExecutor |
|
from datatrove.pipeline.filters import FastTextClassifierFilter |
|
from datatrove.pipeline.readers import ParquetReader,JsonlReader |
|
from datatrove.pipeline.writers.jsonl import JsonlWriter |
|
Path(f"{args.output_path}").mkdir(parents=True,exist_ok=True) |
|
|
|
dist_executor = LocalPipelineExecutor( |
|
skip_completed=False, |
|
pipeline=[ |
|
JsonlReader(f"{args.input_path}", text_key="text", default_metadata= {}), |
|
FastTextClassifierFilter(f"PreSelect-classifier.bin", keep_labels=[("1",0.5)]), |
|
JsonlWriter(f"{args.output_path}", compression=None) |
|
], |
|
tasks=100, |
|
) |
|
dist_executor.run() |
|
``` |
|
|
|
## Training |
|
For more training details, you can refer to the paper and the training code is available on GitHub |
|
[PreSelect](https://github.com/hkust-nlp/preselect). |
|
|