New Dataset

#11

by Raspbfox - opened Apr 17, 2023

Apr 17, 2023

•

edited Apr 17, 2023

Might this new dataset be useful for V10 or further? 👀
https://twitter.com/togethercompute/status/1647917989264519174
https://www.together.xyz/blog/redpajama

Raspbfox

Apr 17, 2023

It seems to be an open data-set of high quality with the size comparable to the LLaMa paper numbers! Sounds exciting!

snapo

Apr 18, 2023

i did already download the dataset and i am on the way on cleaning it up :-) , After cleaning all data up and having nearly 100% good sentences, i have in mind to run every sentence from the dataset through NLLB and creaate the same dataset for 200 languages. Then we probably have a very good base for multilingual LLM's. but it will take a huge ammount of time and i am already thinking on how i will store the 200+GB of data :-( Any idea on that @Raspbfox ?

Raspbfox

Apr 18, 2023

Hah, I know it's obvious, but, compression? Text compresses really, really well!
Some compression algos store the metadata to still be navigable!

Raspbfox

Apr 18, 2023

How would you rate the quality of the dataset @snapo ?

snapo

Apr 18, 2023

from a first look the data does not look that great ... thats why i have to filter and take out only english sentences, after that i have to filter out all code so that only would learn language. to learn code i would create a separate dataset from github data that is under Apache, GPL licensed code because thats the only thing that is free to used for corporations. will let you know on my fiiltering progress :-) (Never worked with so much data before).... for the code to have also translations i have to filter out comments and translate the comments without the corresponding variables,function names (will be kinda difficult). When all that is done then the instructions can be learned with fine tuning and either manual translation or automatic translation with NLLB (Translation on the bigger models of NLLB is very good!). I need at least 1 month to even be able to make an estimate on how big the dataset resulting with 200 languages will be :-)

snapo

Apr 18, 2023

small example, mix of unicode and json, also some downloads had multiple restarts because of hash mismatches.... each dataset type (location) has a different layout in json, around 800GB is already ZST compressed :-) therefore more compression might not work hehehehe , lets see how it plays out.

snapo

Apr 18, 2023

Oh and some text files contain swear words...

Raspbfox

Apr 18, 2023

Just as a simplification, I am not sure it would be a good sink of time to translate code comments, as it's considered a "bad smell" to not have your comments in English anyway :D
Unless, of course, it's beneficial for the model to later generalize the knowledge about languages and their translations.

snapo

Apr 18, 2023

Do you know if there is a limit on total repo size for datasets in huggingface? @Raspbfox

Raspbfox

Apr 18, 2023

No idea, tbh, will need to Google that :D

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment