nielsr's picture
nielsr HF staff
Add pipeline tag and library name
2cb3c2b verified
|
raw
history blame
2.72 kB
metadata
license: mit
pipeline_tag: TEXT_CLASSIFICATION
library_name: fasttext

๐Ÿ“‘ Paper    |    ๐Ÿ”จ fastText Classifier    |    ๐Ÿค— Released Dataset    |    ๐Ÿ“ฆ Repo

Model Summary

This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: Predictive Data Selection: The Data That Predicts Is the Data That Teaches . And this is also the classifier we used to build PreSelect-100B dataset with a selection threshold of 10%. The positive label name and negative label name are "__label__1" and "__label__0" respectively.

How to use

You can refer to the code repo of the paper to directly run the filtering with any fastText model or simply:

import os
import argparse
from pathlib import Path

parser = argparse.ArgumentParser("Filter")
parser.add_argument("--input_path",type=str, help="input path name")
parser.add_argument("--output_path",type=str, help="output name")

args = parser.parse_args()
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.filters import FastTextClassifierFilter
from datatrove.pipeline.readers import ParquetReader,JsonlReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
Path(f"{args.output_path}").mkdir(parents=True,exist_ok=True)

dist_executor = LocalPipelineExecutor(
    skip_completed=False,
    pipeline=[
        JsonlReader(f"{args.input_path}", text_key="text", default_metadata= {}),
        FastTextClassifierFilter(f"PreSelect-classifier.bin", keep_labels=[("1",0.5)]), 
        JsonlWriter(f"{args.output_path}", compression=None)
    ],
    tasks=100,
)
dist_executor.run()

Training

For more training details, you can refer to the paper and the training code is available on GitHub PreSelect.

Citation

If you find this work helpful, please kindly cite as:

@article{shum2025predictivedataselectiondata,
      title={Predictive Data Selection: The Data That Predicts Is the Data That Teaches}, 
      author={Kashun Shum and Yuzhen Huang and Hongjian Zou and Ding Qi and Yixuan Liao and Xiaoxin Chen and Qian Liu and Junxian He},
      journal={arXiv preprint arXiv:2503.00808},
      year={2025},
      eprint={2503.00808},
}