--- tags: - generated_from_trainer datasets: - RaiBP/openwebtext2-first-30-chunks-ablation-translation model-index: - name: training_translation results: [] --- # training_translation This model was trained from scratch on the RaiBP/openwebtext2-first-30-chunks-ablation-translation dataset. ## Model description More information needed ## Intended uses & limitations More information needed ## Training and evaluation data More information needed ## Training procedure The [`run_clm.py` script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py) from the transformers library was used. Training was distributed on two NVIDIA Quadro RTX 6000 GPUs: ```bash TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0,1 nohup python -m torch.distributed.launch \ --nproc_per_node=2 run_clm.py --output_dir="./training_translation" \ --model_type="gpt2" \ --config_name="./training" \ --tokenizer_name="./training" \ --dataset_name="RaiBP/openwebtext2-first-30-chunks-ablation-translation" \ --do_train \ --per_device_train_batch_size 8 \ --block_size="1024" \ --learning_rate="5e-3" --warmup_steps="1000" \ --adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \ --overwrite_output_dir \ --num_train_epochs="1" \ --logging_steps="500" \ --save_steps="5000" --preprocessing_num_workers="16" \ --gradient_accumulation_steps="4" --report_to="tensorboard" \ --logging_dir="./log_translation" > command_translation_log.log 2>&1 & ``` ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.005 - train_batch_size: 8 - eval_batch_size: 8 - seed: 42 - distributed_type: multi-GPU - num_devices: 2 - gradient_accumulation_steps: 4 - total_train_batch_size: 64 - total_eval_batch_size: 16 - optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_steps: 1000 - num_epochs: 1.0 ### Training results ### Evaluation results Perplexity on random 2000 examples of the target language's [Wikipedia dataset](https://huggingface.co/datasets/wikimedia/wikipedia), using the code provided in the [perplexity docs](https://huggingface.co/docs/transformers/perplexity), with 512 tokes of stride. Baseline is the result from evaluating [OpenAI's GPT-2](https://huggingface.co/gpt2) on the same examples. | Target language | PPL | Baseline PPL | |-----------------|-------------------|-------------------| | en |39.97170639038086 |26.562532424926758 | | de |25.49677848815918 |56.907039642333984 | | es |21.964618682861328 |55.592445373535156 | | fr | 25.343358993530273 |49.69472885131836 | |it |25.46650505065918 |75.95120239257812 | |pt | 19.93419075012207 || |nl | 32.07345199584961 || The following script was used for evaluation ```python import numpy as np from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForCausalLM import torch from tqdm import tqdm import random # Set the seed for reproducibility random.seed(42) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # Load the model model_name = "RaiBP/gpt2-openwebtext2-first-30-chunks-ablation-translation" model = AutoModelForCausalLM.from_pretrained(model_name).to(device) tokenizer = AutoTokenizer.from_pretrained(model_name) target_language_dataset = "20231101.de" # change here for other languages dataset = load_dataset("wikimedia/wikipedia", target_language_dataset, split="train") num_examples = 2000 random_numbers = list(np.random.randint(0, len(dataset), num_examples)) examples = [] for i in tqdm(random_numbers): examples.append(dataset[int(i)]["text"]) encodings = tokenizer("\n\n".join(examples), return_tensors="pt") max_length = model.config.n_positions stride = 512 seq_len = encodings.input_ids.size(1) nlls = [] prev_end_loc = 0 for begin_loc in tqdm(range(0, seq_len, stride)): end_loc = min(begin_loc + max_length, seq_len) trg_len = end_loc - prev_end_loc # may be different from stride on last loop input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device) target_ids = input_ids.clone() target_ids[:, :-trg_len] = -100 with torch.no_grad(): outputs = model(input_ids, labels=target_ids) # loss is calculated using CrossEntropyLoss which averages over valid labels # N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels # to the left by 1. neg_log_likelihood = outputs.loss nlls.append(neg_log_likelihood) prev_end_loc = end_loc if end_loc == seq_len: break ppl = torch.exp(torch.stack(nlls).mean()) print("Perplexity: ", ppl.item()) ``` ### Framework versions - Transformers 4.37.0.dev0 - Pytorch 1.13.0 - Datasets 2.16.0 - Tokenizers 0.15.0