GPT-α

A pretrained GPT model with 124M parameters trained on 40B tokens of educational content. The full implementation of the model can be found on GitHub here. The model was trained for 4 epochs on the 10B token subset of fineweb-edu, a large-scale dataset of educational content.

Here are some example completions from the model after training on 40B tokens. The context is `Once upon a time,'. The completions are generated using the top-k sampling strategy with a maximum length of 64 tokens, a temperature of 1.0 and a k value of 50.

Once upon a time, people were going to buy the “cork” that was used to wrap and hang the wine.
However, what began to be called “cork” as soon as the time rolled around was probably an artificial wine. This is how we know cork as the “cork”

Once upon a time, there was a time in the history of India when the great religion of India was worshipped by only two people… the Hindus and the Jains. This is the story of how the story of India was created.
India’s story begins with a very ancient Vedic religion. They were the ancient Indus valley

Once upon a time, the King of Italy, who was to govern what would become the world, thought that it would be a great and noble undertaking to introduce the Roman Senate into the country in order to defend Rome — to defend her own capital in a very civilized manner, to promote the arts and promote the Roman religion. Accordingly, Rome,

Training

The exact model architecture and training script can be found on GitHub. GPT-α uses the GPT-2 tokeniser. The model was trained on 40B tokens over 76,296 iterations using a cosine learning rate schedule is used, with a warmup period of 375M tokens. A max learning rate of 18e-4 (3x that of GPT-3) is used with a linear decay over the training period. Overall, training lasted for a continuous 11.5 hours on 8× A100-SMX4 40GB GPUs running at a pace of 1.07M tokens per second when using a batch size of 16. The model surpassed GPT-3 124M on HellaSwag after just 38B tokens, this is a 7.8x improvement over GPT-3 which was trained on 300B tokens. The final model at 40B tokens achieved a HellaSwag score of 0.339.

Inference

The model can be directly used with a pipeline for text generation:

>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='fraserlove/gpt-alpha')
>>> set_seed(0)
>>> generator('Once upon a time,', max_length=30, num_return_sequences=5, do_sample=True)

[{'generated_text': 'Once upon a time, my father had some way that would help him win his first war. There was a man named John. He was the husband'},
 {'generated_text': 'Once upon a time, this particular breed would be considered a “chicken fan”; today, the breed is classified as a chicken.'},
 {'generated_text': 'Once upon a time, there was a famous English nobleman named King Arthur (in the Middle Ages, it was called ‘the Arthur’'},
 {'generated_text': "Once upon a time, the Christian God created the world in the manner which, under different circumstances, was true of the world's existence. The universe"},
 {'generated_text': 'Once upon a time, I wrote all of the letters of an alphabets in a single document. Then I read each letter of that alphabet'}]

The model can also be used directly for inference:

from transformers import AutoTokenizer, AutoModelForCausalLM
device = 'cuda' # for GPU usage or 'cpu' for CPU usage
tokeniser = AutoTokenizer.from_pretrained('fraserlove/gpt-alpha')
# For multi-GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map='auto')`
model = AutoModelForCausalLM.from_pretrained('fraserlove/gpt-alpha').to(device)
context = tokeniser.encode('Once upon a time,', return_tensors='pt').to(device)
samples = model.generate(context, do_sample=True)
print(tokeniser.decode(samples[0]))

To get the features of a given text:

from transformers import AutoTokenizer, AutoModelForCausalLM
device = 'cuda' # for GPU usage or 'cpu' for CPU usage
tokeniser = AutoTokenizer.from_pretrained('fraserlove/gpt-alpha')
model = AutoModelForCausalLM.from_pretrained('fraserlove/gpt-alpha').to(device)
encoded = tokeniser('Once upon a time,', return_tensors='pt').to(device)
output = model(**encoded)

Evaluation

Benchmark	GPT-α 124M	GPT-2 124M	GPT-Neo 125M	OPT 125M	Pythia 160M
PIQA	63.06%	62.51%	62.46%	62.08%	61.26%
SIQA	38.18%	36.59%	37.21%	37.21%	36.69%
OpenBookQA	29.80%	27.20%	26.20%	28.00%	27.00%
TriviaQA	1.31%	0.30%	0.66%	1.18%	0.41%
TruthfulQA	33.13%	31.73%	35.70%	33.50%	34.75%
MMLU	23.30%	25.90%	25.58%	25.94%	25.10%
WinoGrande	50.20%	50.04%	51.70%	51.07%	48.78%
ARC Challenge	29.18%	22.95%	22.87%	22.10%	22.10%
HellaSwag	35.74%	31.64%	30.58%	31.69%	30.15%
GSM-8K	2.27%	0.68%	1.74%	1.74%	2.20%
Average Score	30.62%	28.95%	29.47%	29.45%	28.84%

fraserlove
/

gpt-alpha

GPT-α

Training

Inference

Evaluation

Dataset used to train fraserlove/gpt-alpha