Model token size is bigger than tokenizer size?

#97

by fahadh4ilyas - opened Jan 25

Discussion

fahadh4ilyas

Jan 25

Tokenizer vocab size is 50295 but embedding and head size is 51200. is it intentional?

iliyaML

Jan 25

This is a good reference: https://huggingface.co/microsoft/phi-2/discussions/22#659d8ba950c1bbee5be6f179

We ended up setting 51200 as the vocabulary size just to accommodate any new tokens that we might need in the future. You can follow @Deepakvictor answer and it should fix the issue.

As far as I know, no tokens from 50295+ should be generated because those embeddings were not trained. Though, depending on the generation's parameters, they could appear (low probabilities however).

gugarosa

Microsoft org Jan 26

Thanks for the answer @iliyaML !

gugarosa changed discussion status to closed Jan 26

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment