Update README.md

d049a7b verified about 1 year ago

5.11 kB

	---
	tags:
	- generated_from_trainer
	datasets:
	- RaiBP/openwebtext2-first-30-chunks-ablation-translation
	model-index:
	- name: training_translation
	results: []
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# training_translation

	This model was trained from scratch on the RaiBP/openwebtext2-first-30-chunks-ablation-translation dataset.

	## Model description

	More information needed

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	More information needed

	## Training procedure
	The [`run_clm.py` script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py) from the transformers library was used. Training was distributed on two NVIDIA Quadro RTX 6000 GPUs:
	```bash
	TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO CUDA_VISIBLE_DEVICES=0,1 nohup python -m torch.distributed.launch \
	--nproc_per_node=2 run_clm.py --output_dir="./training_translation" \
	--model_type="gpt2" \
	--config_name="./training" \
	--tokenizer_name="./training" \
	--dataset_name="RaiBP/openwebtext2-first-30-chunks-ablation-translation" \
	--do_train \
	--per_device_train_batch_size 8 \
	--block_size="1024" \
	--learning_rate="5e-3" --warmup_steps="1000" \
	--adam_beta1="0.9" --adam_beta2="0.98" --weight_decay="0.01" \
	--overwrite_output_dir \
	--num_train_epochs="1" \
	--logging_steps="500" \
	--save_steps="5000" --preprocessing_num_workers="16" \
	--gradient_accumulation_steps="4" --report_to="tensorboard" \
	--logging_dir="./log_translation" > command_translation_log.log 2>&1 &
	```
	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 0.005
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- distributed_type: multi-GPU
	- num_devices: 2
	- gradient_accumulation_steps: 4
	- total_train_batch_size: 64
	- total_eval_batch_size: 16
	- optimizer: Adam with betas=(0.9,0.98) and epsilon=1e-08
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 1000
	- num_epochs: 1.0

	### Training results
	### Evaluation results
	Perplexity on random 2000 examples of the target language's [Wikipedia dataset](https://huggingface.co/datasets/wikimedia/wikipedia), using the code provided in the [perplexity docs](https://huggingface.co/docs/transformers/perplexity), with 512 tokes of stride.
	Baseline is the result from evaluating [OpenAI's GPT-2](https://huggingface.co/gpt2) on the same examples.
	\| Target language \| PPL \| Baseline PPL \|
	\|-----------------\|-------------------\|-------------------\|
	\| en \|39.97170639038086 \|26.562532424926758 \|
	\| de \|25.49677848815918 \|56.907039642333984 \|
	\| es \|21.964618682861328 \|55.592445373535156 \|
	\| fr \| 25.343358993530273 \|49.69472885131836 \|
	\|it \|25.46650505065918 \|75.95120239257812 \|
	\|pt \| 19.93419075012207 \|\|
	\|nl \| 32.07345199584961 \|\|

	The following script was used for evaluation


	```python
	import numpy as np
	from datasets import load_dataset
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch
	from tqdm import tqdm
	import random

	# Set the seed for reproducibility
	random.seed(42)

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

	# Load the model
	model_name = "RaiBP/gpt2-openwebtext2-first-30-chunks-ablation-translation"
	model = AutoModelForCausalLM.from_pretrained(model_name).to(device)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	target_language_dataset = "20231101.de" # change here for other languages

	dataset = load_dataset("wikimedia/wikipedia", target_language_dataset, split="train")
	num_examples = 2000
	random_numbers = list(np.random.randint(0, len(dataset), num_examples))
	examples = []
	for i in tqdm(random_numbers):
	examples.append(dataset[int(i)]["text"])
	encodings = tokenizer("\n\n".join(examples), return_tensors="pt")

	max_length = model.config.n_positions
	stride = 512
	seq_len = encodings.input_ids.size(1)

	nlls = []
	prev_end_loc = 0
	for begin_loc in tqdm(range(0, seq_len, stride)):
	end_loc = min(begin_loc + max_length, seq_len)
	trg_len = end_loc - prev_end_loc # may be different from stride on last loop
	input_ids = encodings.input_ids[:, begin_loc:end_loc].to(device)
	target_ids = input_ids.clone()
	target_ids[:, :-trg_len] = -100

	with torch.no_grad():
	outputs = model(input_ids, labels=target_ids)

	# loss is calculated using CrossEntropyLoss which averages over valid labels
	# N.B. the model only calculates loss over trg_len - 1 labels, because it internally shifts the labels
	# to the left by 1.
	neg_log_likelihood = outputs.loss

	nlls.append(neg_log_likelihood)

	prev_end_loc = end_loc
	if end_loc == seq_len:
	break

	ppl = torch.exp(torch.stack(nlls).mean())

	print("Perplexity: ", ppl.item())
	```


	### Framework versions

	- Transformers 4.37.0.dev0
	- Pytorch 1.13.0
	- Datasets 2.16.0
	- Tokenizers 0.15.0