roneneldan/TinyStories-1M · Which 10K tokens?

Hi! I'm trying to figure out which of the token embeddings have been trained and which are unused on the training data. It doesn't seem to be the first 10K by ID, and (if I understand correctly) the dataset itself will produce more than 10K distinct tokens if I run the standard GPT-Neo tokenizer on it.

Is the 10K most frequent GPT-Neo tokens in the training split (or the entire dataset)? Do you have a list of the IDs somewhere?

Looking at the embeddings (PCA components, etc), there don't seem to be 10K that are linearly separable, although it does look bimodal as expected.

Also, the tokenizer included in this repo (with 50k tokens) is just the GPT-Neo one, correct?

Thanks!