metadata

tags:
  - generated_from_trainer
datasets:
  - RaiBP/openwebtext2-first-30-chunks-english-only-examples
model-index:
  - name: training_nen
    results: []

training_nen

This model was trained from scratch on the RaiBP/openwebtext2-first-30-chunks-english-only-examples dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.005
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 2
gradient_accumulation_steps: 4
total_train_batch_size: 64
total_eval_batch_size: 16
optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 1.0

Training results

Evaluation results

Perplexity on random 2000 examples of the target language's Wikipedia dataset, using the code provided in the perplexity docs, with 512 tokes of stride. Baseline is the result from evaluating OpenAI's GPT-2 on the same examples.

Target language	PPL	Baseline PPL
en	42.175106048583984	26.562532424926758
de	225.5620574951172	56.907039642333984
es	184.9262237548828	55.592445373535156
fr	170.0771026611328
it	238.36192321777344
pt	203.595947265625
nl	225.9720001220703

The following script was used for evaluation

import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from tqdm import tqdm
import random

# Set the seed for reproducibility
random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model
model_name = "RaiBP/gpt2-openwebtext2-first-30-chunks-ablation-non-english"
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

target_language_dataset = "20231101.de" # change here for other languages

dataset = load_dataset("wikimedia/wikipedia", target_language_dataset, split="train")
num_examples = 2000
random_numbers = list(np.random.randint(0, len(dataset), num_examples))
examples = []
for i in tqdm(random_numbers):
    examples.append(dataset[int(i)]["text"])
encodings = tokenizer("\n\n".join(examples), return_tensors="pt")

max_length = model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over valid labels
        # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
        # to the left by 1.
        neg_log_likelihood = outputs.loss

    nlls.append(neg_log_likelihood)

    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).mean())

print("Perplexity: ", ppl.item())

Framework versions

Transformers 4.37.0.dev0
Pytorch 1.13.0
Datasets 2.16.0
Tokenizers 0.15.0