license: mit
pipeline_tag: TEXT_CLASSIFICATION
library_name: fasttext
๐ Paper | ๐จ fastText Classifier | ๐ค Released Dataset | ๐ฆ Repo
Model Summary
This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: Predictive Data Selection: The Data That Predicts Is the Data That Teaches . And this is also the classifier we used to build PreSelect-100B dataset with a selection threshold of 10%. The positive label name and negative label name are "__label__1" and "__label__0" respectively.
How to use
You can refer to the code repo of the paper to directly run the filtering with any fastText model or simply:
import os
import argparse
from pathlib import Path
parser = argparse.ArgumentParser("Filter")
parser.add_argument("--input_path",type=str, help="input path name")
parser.add_argument("--output_path",type=str, help="output name")
args = parser.parse_args()
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.filters import FastTextClassifierFilter
from datatrove.pipeline.readers import ParquetReader,JsonlReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
Path(f"{args.output_path}").mkdir(parents=True,exist_ok=True)
dist_executor = LocalPipelineExecutor(
skip_completed=False,
pipeline=[
JsonlReader(f"{args.input_path}", text_key="text", default_metadata= {}),
FastTextClassifierFilter(f"PreSelect-classifier.bin", keep_labels=[("1",0.5)]),
JsonlWriter(f"{args.output_path}", compression=None)
],
tasks=100,
)
dist_executor.run()
Training
For more training details, you can refer to the paper and the training code is available on GitHub PreSelect.
Citation
If you find this work helpful, please kindly cite as:
@article{shum2025predictivedataselectiondata,
title={Predictive Data Selection: The Data That Predicts Is the Data That Teaches},
author={Kashun Shum and Yuzhen Huang and Hongjian Zou and Ding Qi and Yixuan Liao and Xiaoxin Chen and Qian Liu and Junxian He},
journal={arXiv preprint arXiv:2503.00808},
year={2025},
eprint={2503.00808},
}