|
--- |
|
tags: |
|
- generated_from_trainer |
|
datasets: |
|
- RaiBP/openwebtext2-first-30-chunks-ablation-translation |
|
model-index: |
|
- name: training_translation |
|
results: [] |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# training_translation |
|
|
|
This model was trained from scratch on the RaiBP/openwebtext2-first-30-chunks-ablation-translation dataset. |
|
|
|
## Model description |
|
|
|
More information needed |
|
|
|
## Intended uses & limitations |
|
|
|
More information needed |
|
|
|
## Training and evaluation data |
|
|
|
More information needed |
|
|
|
## Training procedure |
|
The [`run_clm.py` script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py) from the transformers library was used. Training was distributed on two NVIDIA Quadro RTX 6000 GPUs: |
|
```bash |
|
TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0,1 nohup python -m torch.distributed.launch \ |
|
--nproc_per_node=2 run_clm.py --output_dir="./training_translation" \ |
|
--model_type="gpt2" \ |
|
--config_name="./training" \ |
|
--tokenizer_name="./training" \ |
|
--dataset_name="RaiBP/openwebtext2-first-30-chunks-ablation-translation" \ |
|
--do_train \ |
|
--per_device_train_batch_size 8 \ |
|
--block_size="1024" \ |
|
--learning_rate="5e-3" --warmup_steps="1000" \ |
|
--adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \ |
|
--overwrite_output_dir \ |
|
--num_train_epochs="1" \ |
|
--logging_steps="500" \ |
|
--save_steps="5000" --preprocessing_num_workers="16" \ |
|
--gradient_accumulation_steps="4" --report_to="tensorboard" \ |
|
--logging_dir="./log_translation" > command_translation_log.log 2>&1 & |
|
``` |
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 0.005 |
|
- train_batch_size: 8 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- distributed_type: multi-GPU |
|
- num_devices: 2 |
|
- gradient_accumulation_steps: 4 |
|
- total_train_batch_size: 64 |
|
- total_eval_batch_size: 16 |
|
- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08 |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_steps: 1000 |
|
- num_epochs: 1.0 |
|
|
|
### Training results |
|
### Evaluation results |
|
Perplexity on random 2000 examples of the target language's [Wikipedia dataset](https://huggingface.co/datasets/wikimedia/wikipedia), using the code provided in the [perplexity docs](https://huggingface.co/docs/transformers/perplexity), with 512 tokes of stride. |
|
Baseline is the result from evaluating [OpenAI's GPT-2](https://huggingface.co/gpt2) on the same examples. |
|
| Target language | PPL | Baseline PPL | |
|
|-----------------|-------------------|-------------------| |
|
| en |39.97170639038086 |26.562532424926758 | |
|
| de |25.49677848815918 |56.907039642333984 | |
|
| es |21.964618682861328 |55.592445373535156 | |
|
| fr | 25.343358993530273 |49.69472885131836 | |
|
|it |25.46650505065918 |75.95120239257812 | |
|
|pt | 19.93419075012207 || |
|
|nl | 32.07345199584961 || |
|
|
|
The following script was used for evaluation |
|
|
|
|
|
```python |
|
import numpy as np |
|
from datasets import load_dataset |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
import torch |
|
from tqdm import tqdm |
|
import random |
|
|
|
# Set the seed for reproducibility |
|
random.seed(42) |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
|
# Load the model |
|
model_name = "RaiBP/gpt2-openwebtext2-first-30-chunks-ablation-translation" |
|
model = AutoModelForCausalLM.from_pretrained(model_name).to(device) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
target_language_dataset = "20231101.de" # change here for other languages |
|
|
|
dataset = load_dataset("wikimedia/wikipedia", target_language_dataset, split="train") |
|
num_examples = 2000 |
|
random_numbers = list(np.random.randint(0, len(dataset), num_examples)) |
|
examples = [] |
|
for i in tqdm(random_numbers): |
|
examples.append(dataset[int(i)]["text"]) |
|
encodings = tokenizer("\n\n".join(examples), return_tensors="pt") |
|
|
|
max_length = model.config.n_positions |
|
stride = 512 |
|
seq_len = encodings.input_ids.size(1) |
|
|
|
nlls = [] |
|
prev_end_loc = 0 |
|
for begin_loc in tqdm(range(0, seq_len, stride)): |
|
end_loc = min(begin_loc + max_length, seq_len) |
|
trg_len = end_loc - prev_end_loc # may be different from stride on last loop |
|
input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device) |
|
target_ids = input_ids.clone() |
|
target_ids[:, :-trg_len] = -100 |
|
|
|
with torch.no_grad(): |
|
outputs = model(input_ids, labels=target_ids) |
|
|
|
# loss is calculated using CrossEntropyLoss which averages over valid labels |
|
# N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels |
|
# to the left by 1. |
|
neg_log_likelihood = outputs.loss |
|
|
|
nlls.append(neg_log_likelihood) |
|
|
|
prev_end_loc = end_loc |
|
if end_loc == seq_len: |
|
break |
|
|
|
ppl = torch.exp(torch.stack(nlls).mean()) |
|
|
|
print("Perplexity: ", ppl.item()) |
|
``` |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.37.0.dev0 |
|
- Pytorch 1.13.0 |
|
- Datasets 2.16.0 |
|
- Tokenizers 0.15.0 |
|
|