RaiBP's picture
Update README.md
34c841d verified
|
raw
history blame
4.03 kB
metadata
tags:
  - generated_from_trainer
datasets:
  - RaiBP/openwebtext2-first-30-chunks-english-only-examples
model-index:
  - name: training_nen
    results: []

training_nen

This model was trained from scratch on the RaiBP/openwebtext2-first-30-chunks-english-only-examples dataset.

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.005
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 2
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 64
  • total_eval_batch_size: 16
  • optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 1000
  • num_epochs: 1.0

Training results

Evaluation results

Perplexity on random 2000 examples of the target language's Wikipedia dataset, using the code provided in the perplexity docs, with 512 tokes of stride. Baseline is the result from evaluating OpenAI's GPT-2 on the same examples.

Target language PPL Baseline PPL
en 42.175106048583984 26.562532424926758
de 225.5620574951172 56.907039642333984
es 184.9262237548828 55.592445373535156
fr 170.0771026611328
it 238.36192321777344
pt 203.595947265625
nl 225.9720001220703

The following script was used for evaluation

import numpy as np
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
from tqdm import tqdm
import random

# Set the seed for reproducibility
random.seed(42)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model
model_name = "RaiBP/gpt2-openwebtext2-first-30-chunks-ablation-non-english"
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_name)

target_language_dataset = "20231101.de" # change here for other languages

dataset = load_dataset("wikimedia/wikipedia", target_language_dataset, split="train")
num_examples = 2000
random_numbers = list(np.random.randint(0, len(dataset), num_examples))
examples = []
for i in tqdm(random_numbers):
    examples.append(dataset[int(i)]["text"])
encodings = tokenizer("\n\n".join(examples), return_tensors="pt")

max_length = model.config.n_positions
stride = 512
seq_len = encodings.input_ids.size(1)

nlls = []
prev_end_loc = 0
for begin_loc in tqdm(range(0, seq_len, stride)):
    end_loc = min(begin_loc + max_length, seq_len)
    trg_len = end_loc - prev_end_loc  # may be different from stride on last loop
    input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
    target_ids = input_ids.clone()
    target_ids[:, :-trg_len] = -100

    with torch.no_grad():
        outputs = model(input_ids, labels=target_ids)

        # loss is calculated using CrossEntropyLoss which averages over valid labels
        # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
        # to the left by 1.
        neg_log_likelihood = outputs.loss

    nlls.append(neg_log_likelihood)

    prev_end_loc = end_loc
    if end_loc == seq_len:
        break

ppl = torch.exp(torch.stack(nlls).mean())

print("Perplexity: ", ppl.item())

Framework versions

  • Transformers 4.37.0.dev0
  • Pytorch 1.13.0
  • Datasets 2.16.0
  • Tokenizers 0.15.0