README.md · hkust-nlp/preselect-fasttext-classifier at 95d1d6789460d30a85a5b8e64db2c8ca709023d1

metadata

license: mit

Model Summary

This is a fastText-based binary classifier for identifying high-quality data in the pretraining corpus introduced in paper: Predictive Data Selection: The Data That Predicts Is the Data That Teaches . And this is also the classifier we used to build PreSelect-100B dataset with a selection threshold of 10%. The positive label name and negative label name are "__label__1" and "__label__0" respectively.

How to use

You can refer to the code repo of the paper to directly run the filtering with any fastText model or simply:

import os
import argparse
from pathlib import Path

parser = argparse.ArgumentParser("Filter")
parser.add_argument("--input_path",type=str, help="input path name")
parser.add_argument("--output_path",type=str, help="output name")

args = parser.parse_args()
from datatrove.executor import LocalPipelineExecutor
from datatrove.pipeline.filters import FastTextClassifierFilter
from datatrove.pipeline.readers import ParquetReader,JsonlReader
from datatrove.pipeline.writers.jsonl import JsonlWriter
Path(f"{args.output_path}").mkdir(parents=True,exist_ok=True)

dist_executor = LocalPipelineExecutor(
    skip_completed=False,
    pipeline=[
        JsonlReader(f"{args.input_path}", text_key="text", default_metadata= {}),
        FastTextClassifierFilter(f"PreSelect-classifier.bin", keep_labels=[("1",0.5)]), 
        JsonlWriter(f"{args.output_path}", compression=None)
    ],
    tasks=100,
)
dist_executor.run()

Training

For more training details, you can refer to the paper and the training code is available on GitHub PreSelect.