I created a Capybara-inspired Italian dataset by translating the initial instruction and running it through a pipeline to generate conversations. I used Claude Sonnet for translation and instruction generation, and Opus for generating the answers.
I hope this dataset proves useful for people working on ๐ฎ๐น language models.
@mik3ml just released ReDiX/wikipediaQA-ita an interesting synthetic dataset originated from wikipedia using a fine tuned version of mistral-7B specific for the Italian language ๐ฎ๐น .
On evaluating fine tuned 7B Italian open source LLMs I have collected many data points and I created a super simple explorative analyses. My hypothesis based on data are:
- mmlu is hard to improve when fine tuning a base model on a different language - fine tuning also on single GPUs can improve by 5% to 10% the base model on common tasks but a lot more on specific cases with the right training time and data - fine tuning can specialize well but at cost of loosing some foundational knowledge.
Based on the work of @mrinaldi and @ruggsea we just released the biggest - ready for training - conversational dataset based on Usenet data in the Italian language ๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น. It contains about 9 millions of conversations made by real humans.
It is based on lm-evaluation-harness and at the moment , mainly, on 7 billion models. In the next weeks we will add more models. If you have suggestion or need explanations join our community discord https://discord.gg/a26cRkBCNH
The dataset contributes to the https://huggingface.co/mii-community project, aimed at advancing the creation of Italian open-source Language Models (LLMs).๐ฎ๐น ๐ค About 10-20 billion token, probably the best conversational open source dataset in the Italian language. ๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น๐ฎ๐น