Spaces:

argilla
/

synthetic-data-generator

Running

davidberenstein1957 HF staff commited on Dec 30, 2024

Commit

08b43db

1 Parent(s): d75924a

update title

Files changed (1) hide show

examples/fine-tune-modernbert-classifier.ipynb CHANGED Viewed

@@ -4,13 +4,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Fine tuning ModernBERT for text classification using synthetic data generation\n",
     "\n",
     "LLMs are great general purpose models, but they are not always the best choice for a specific task. Therefore, smaller and more specialized models are important for sustainable, efficient, and cheaper AI.\n",
     "\n",
-    "A lack of dedicated datasets is a common problem for smaller and more specialized models. This is because it is difficult to find a dataset that is both representative and diverse enough for a specific task. We solve this problem by generating a synthetic dataset from an LLM using the `synthetic-data-generator`, which is available as [Hugging Face Space](https://huggingface.co/spaces/argilla/synthetic-data-generator) or on [GitHub](https://github.com/argilla-io/synthetic-data-generator).\n",
-    "\n",
-    "In this example, we will finetune a ModernBERT model on a synthetic dataset generated from the synthetic-data-generator. Showing the effectiveness of synthetic data and the novel ModernBERT model, which is new and improved version of BERT models, with 8192 token context length, significantly better downstream performance, and much faster processing speeds.\n",
     "\n",
     "## Install the dependencies"
    ]

    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "# Fine-tune ModernBERT for text classification using synthetic data\n",
     "\n",
     "LLMs are great general purpose models, but they are not always the best choice for a specific task. Therefore, smaller and more specialized models are important for sustainable, efficient, and cheaper AI.\n",
+    "A lack of domain sepcific datasets is a common problem for smaller and more specialized models. This is because it is difficult to find a dataset that is both representative and diverse enough for a specific task. We solve this problem by generating a synthetic dataset from an LLM using the `synthetic-data-generator`, which is available as a [Hugging Face Space](https://huggingface.co/spaces/argilla/synthetic-data-generator) or on [GitHub](https://github.com/argilla-io/synthetic-data-generator).\n",
     "\n",
+    "In this example, we will fine-tune a ModernBERT model on a synthetic dataset generated from the synthetic-data-generator. This demonstrates the effectiveness of synthetic data and the novel ModernBERT model, which is a new and improved version of BERT models, with an 8192 token context length, significantly better downstream performance, and much faster processing speeds.\n",
     "\n",
     "## Install the dependencies"
    ]