Which 10K tokens?
Hi! I'm trying to figure out which of the token embeddings have been trained and which are unused on the training data. It doesn't seem to be the first 10K by ID, and (if I understand correctly) the dataset itself will produce more than 10K distinct tokens if I run the standard GPT-Neo tokenizer on it.
Is the 10K most frequent GPT-Neo tokens in the training split (or the entire dataset)? Do you have a list of the IDs somewhere?
Looking at the embeddings (PCA components, etc), there don't seem to be 10K that are linearly separable, although it does look bimodal as expected.
Also, the tokenizer included in this repo (with 50k tokens) is just the GPT-Neo one, correct?
Thanks!
I am also wondering how this worked for them. Ideally, getting code for their implementation would solve our questions... I am new to tokenizing, but Grok tells me this when giving instructions how to get the 10k top used tokens (says to create a new tokenizer from the GPT-Neo tokens):
Doing this “by hand” can be tricky because GPT-2–style merges define how tokens get combined. If you strip out tokens but keep merges that produce them, you can end up in an inconsistent state