What is going on with your vocab?
"vocab_size": 128288
"128009": "<|eot_id>"
"128256": "<|eot_id|>"
"128257": "<|reserved_special_token_251|>"
...
"128287": "<|reserved_special_token_281|>"
Why increase the vocab size making merges more difficult, when you still have 243 reserved tokens you can work with? Why add a second <|eot_id|>
token and rename the original to <|eot_id>
when you only use ChatML format?
@xzuyn
thats actually the official llama 3's fault. For some reason, it has lots of extra reserved tokens and that <|eot_id|> issue.
@xzuyn
thats actually the official llama 3's fault. For some reason, it has lots of extra reserved tokens and that <|eot_id|> issue.
It's not though. The official one has 1 <|eot_id|>
at ID 128009
, but Nous renamed it to <|eot_id>
and added <|eot_id|>
here. The official one only has reserved tokens up to <|reserved_special_token_250|>
at ID 128255
, while Nous adds more starting from here to here. That's why I made this discussion page.
@xzuyn
thats actually the official llama 3's fault. For some reason, it has lots of extra reserved tokens and that <|eot_id|> issue.It's not though. The official one has 1
<|eot_id|>
atID 128009
, but Nous renamed it to<|eot_id>
and added<|eot_id|>
here. The official one only has reserved tokens up to<|reserved_special_token_250|>
atID 128255
, while Nous adds more starting from here to here. That's why I made this discussion page.
Yea that was a mistake, but the only mistake