import gradio as gr with gr.Blocks() as app: with gr.Row(): with gr.Column(scale=1): pass with gr.Column(scale=3): gr.HTML( """
Distilabel Synthetic Data Generator is an experimental tool that allows you to easily create high-quality datasets for training and fine-tuning language models. It leverages the power of distilabel and advanced language models to generate synthetic data tailored to your specific needs.
This tool simplifies the process of creating custom datasets, enabling you to:
By using Distilabel Synthetic Data Generator, you can rapidly prototype and create datasets for, accelerating your AI development process.
The current implementation is based on Free Serverless Hugging Face Inference Endpoints. They are rate limited but free to use for anyone on the Hugging Face Hub. You can re-use the underlying pipeline to generate data with other distilabel LLM integrations.
Yes, you can run this locally by cloning the Space and installing the requirements with `pip install -r requirements.txt` and running `python app.py`. Alternatively, you can install the distilabel library with `pip install distilabel[hf-inference-endpoints]` and use the pipeline code at the bottom of each application tab. Distilabel also supports running the pipeline with other LLMs. Do make sure to get a valid Hugging Face Token that allows for calling serverless inference endpoints and create datasets on the Hugging Face Hub.
Distilabel is the framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verified research papers.
Synthetic data is data generated by an AI model, instead of being collected from the real world.
AI feedback is feedback provided by an AI model, instead of being provided by a human.
Distilabel can be used for generating synthetic data and AI feedback for a wide variety of projects including traditional predictive NLP (classification, extraction, etc.), or generative and large language model scenarios (instruction following, dialogue generation, judging etc.). Distilabel's programmatic approach allows you to build scalable pipelines for data generation and AI feedback. The goal of distilabel is to accelerate your AI development by quickly generating high-quality, diverse datasets based on verified research methodologies for generating and judging with AI feedback. So, Distilabel is focused and specifically designed to be a tool that for scalable and reliable synthetic data generation.
The Argilla community uses distilabel to create amazing datasets and models.