New Dataset
Might this new dataset be useful for V10 or further? π
https://twitter.com/togethercompute/status/1647917989264519174
https://www.together.xyz/blog/redpajama
It seems to be an open data-set of high quality with the size comparable to the LLaMa paper numbers! Sounds exciting!
i did already download the dataset and i am on the way on cleaning it up :-) , After cleaning all data up and having nearly 100% good sentences, i have in mind to run every sentence from the dataset through NLLB and creaate the same dataset for 200 languages. Then we probably have a very good base for multilingual LLM's. but it will take a huge ammount of time and i am already thinking on how i will store the 200+GB of data :-( Any idea on that @Raspbfox ?
Hah, I know it's obvious, but, compression? Text compresses really, really well!
Some compression algos store the metadata to still be navigable!
from a first look the data does not look that great ... thats why i have to filter and take out only english sentences, after that i have to filter out all code so that only would learn language. to learn code i would create a separate dataset from github data that is under Apache, GPL licensed code because thats the only thing that is free to used for corporations. will let you know on my fiiltering progress :-) (Never worked with so much data before).... for the code to have also translations i have to filter out comments and translate the comments without the corresponding variables,function names (will be kinda difficult). When all that is done then the instructions can be learned with fine tuning and either manual translation or automatic translation with NLLB (Translation on the bigger models of NLLB is very good!). I need at least 1 month to even be able to make an estimate on how big the dataset resulting with 200 languages will be :-)
Just as a simplification, I am not sure it would be a good sink of time to translate code comments, as it's considered a "bad smell" to not have your comments in English anyway :D
Unless, of course, it's beneficial for the model to later generalize the knowledge about languages and their translations.
No idea, tbh, will need to Google that :D