Correct maximum positional embeddings

#17

by nixgd - opened Jan 24

base: refs/heads/main

←

from: refs/pr/17

Discussion Files changed

-1

nixgd

Jan 24

The model appears to have been trained with context window = 512, not 2048 as claimed here. This can be seen by looking at the average loss by sequence position on the GPT4 tiny stories dataset (packed into inputs of length 2048):

It would be great to get this changed (for all tinystories models), as the current config is misleading.

Correct maximum positional embeddings97b72750

Corianas

Jan 25

You are quite correct, not sure what is up with the huggingface models as:
From the paper: OurmodelsareavailableonHuggingfacenamedTinyStories-1M/3M/9M/28M/33M/1Layer/2LayerandTinyStories-Instruct-∗.We use GPT-Neoarchitecturewithwindowsize256andcontext length512 .WeuseGPT-Neotokenizerbut only keep the top 10K most common tokens.

roneneldan

Owner Jan 25

You're right. Our paper does indicate that we use 512 seq len in training, but the model's config should be updated...

atyshka

Mar 16

@roneneldan Do you plan to merge this? I was also planning to contribute a version with the 10K vocab, would you consider merging that too or do you prefer the current format?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment