bigscience/bloom · Why does the ROOTS Corpus not include German language?

Mar 27, 2023

BLOOM has been trained on the ROOTS Corpus. Why does this corpus not contain German in its linguistic makeup, is there a specific reason for that? English, French are included so one would expect German as well.

yjernite

BigScience Workshop org Mar 27, 2023

Hi @akratz ! Each language intentionally included in ROOTS was the outcome of significant human curation to identify good sources and language-specific pre-processing steps. Since the project was volunteer-driven, this meant that languages beyond the starting set (made up of languages with the most speakers around the world) were selected based on participants' interest and bandwidth - we did not manage to create a working group for German in time for training the model. We hope that future efforts for data curation can re-use some of the tools and methodology we proposed to address this limitation though!

You can find more details in the following paper:
https://huggingface.co/papers/2303.03915

yjernite

BigScience Workshop org Mar 27, 2023

Additionally, while BLOOM was not initially trained on German, there has been some really amazing on post-hoc adaptation and language transfer, check out this one!
https://opengptx.dfki.de/