What does that mean by "new language support"?

#209
by huchiahsi - opened

Could any one help me out with this, thank you in advance.

I suppose that UTF-8 encoding can represent any language unit in the world. When a language model like llama 3.1 was trained on huge corpus, it must have included almost every language that could be fonund in the world. So every single character in this world in any language, there should be at most 4 bytes to represent it. I have checked the llama 3.1 vocab file and found all 256 codes were all in the vocab file. So the model should support all the languages in the world since all words in the world can be represented by any combo of 4 of the 256 charaters . If the language is in the training corpus, the llm should support it. So why is there language support in the llama 3.1 or any other llm?

thanks.

Sign up or log in to comment