metadata

library_name: transformers
license: apache-2.0
base_model: alignment-handbook/zephyr-7b-sft-full
tags:
  - alignment-handbook
  - trl
  - dpo
  - generated_from_trainer
  - trl
  - dpo
  - generated_from_trainer
datasets:
  - data/zephyr_uf_rlced_conifer_ref
model-index:
  - name: zephyr-7b-uf-rlced-conifer-group-dpo-2e-alr-0.01
    results: []

zephyr-7b-uf-rlced-conifer-group-dpo-2e-alr-0.01

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-full on the data/zephyr_uf_rlced_conifer_ref dataset. It achieves the following results on the evaluation set:

Loss: 0.2395
Rewards/chosen: -2.8511
Rewards/rejected: -8.5888
Rewards/accuracies: 0.8778
Rewards/margins: 5.7377
Logps/rejected: -1262.6172
Logps/chosen: -677.5837
Logits/rejected: 3.8778
Logits/chosen: 1.9376
Excess Loss: 0.0374
Alpha 0 Uf: 0.5116
Alpha 1 Rlced Conifer: 0.4884
Rewards/chosen 1 Rlced Conifer: -3.0535
Rewards/rejected 1 Rlced Conifer: -10.0348
Rewards/accuracies 1 Rlced Conifer: 0.9097
Rewards/margins 1 Rlced Conifer: 6.9812
Logps/rejected 1 Rlced Conifer: -1451.0132
Logps/chosen 1 Rlced Conifer: -728.9337
Logits/rejected 1 Rlced Conifer: 3.5676
Logits/chosen 1 Rlced Conifer: 1.5730
Task Loss 1 Rlced Conifer: 0.1787
Task Excess Loss 1 Rlced Conifer: 0.0427
Rewards/chosen 0 Uf: -2.0820
Rewards/rejected 0 Uf: -3.4336
Rewards/accuracies 0 Uf: 0.7633
Rewards/margins 0 Uf: 1.3516
Logps/rejected 0 Uf: -584.9677
Logps/chosen 0 Uf: -497.4562
Logits/rejected 0 Uf: 5.1753
Logits/chosen 0 Uf: 3.1000
Task Loss 0 Uf: 0.5185
Task Excess Loss 0 Uf: 0.0724

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-07
train_batch_size: 8
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 4
total_train_batch_size: 256
total_eval_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 2

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen	Excess Loss	Alpha 0 Uf	Alpha 1 Rlced Conifer	Rewards/chosen 1 Rlced Conifer	Rewards/rejected 1 Rlced Conifer	Rewards/accuracies 1 Rlced Conifer	Rewards/margins 1 Rlced Conifer	Logps/rejected 1 Rlced Conifer	Logps/chosen 1 Rlced Conifer	Logits/rejected 1 Rlced Conifer	Logits/chosen 1 Rlced Conifer	Task Loss 1 Rlced Conifer	Task Excess Loss 1 Rlced Conifer	Rewards/chosen 0 Uf	Rewards/rejected 0 Uf	Rewards/accuracies 0 Uf	Rewards/margins 0 Uf	Logps/rejected 0 Uf	Logps/chosen 0 Uf	Logits/rejected 0 Uf	Logits/chosen 0 Uf	Task Loss 0 Uf	Task Excess Loss 0 Uf
0.1689	0.4997	360	0.2674	-2.2066	-5.7976	0.8656	3.5910	-983.4942	-613.1316	1.9639	0.4895	0.0642	0.5765	0.4235	-2.3017	-6.6520	0.8965	4.3503	-1112.7397	-653.7553	1.7066	0.1879	0.2091	0.0748	-1.8461	-2.7792	0.7426	0.9330	-519.5245	-473.8738	3.0556	1.4702	0.5392	0.0891
0.1413	0.9993	720	0.2485	-2.0138	-6.1196	0.8741	4.1059	-1015.6987	-593.8471	2.5252	1.3345	0.0465	0.6417	0.3583	-2.0972	-7.0507	0.9047	4.9535	-1152.6036	-633.2974	2.1536	1.0120	0.1925	0.0584	-1.6822	-2.7943	0.7670	1.1121	-521.0374	-457.4840	4.0168	2.3771	0.4989	0.0595
0.0671	1.4990	1080	0.2408	-2.5432	-7.7524	0.8741	5.2092	-1178.9786	-646.7894	3.9871	2.3348	0.0389	0.5284	0.4716	-2.6717	-8.9931	0.9071	6.3215	-1346.8500	-690.7497	3.5948	1.9516	0.1822	0.0462	-2.0401	-3.3250	0.7500	1.2849	-574.1076	-493.2740	5.5773	3.5557	0.5197	0.0655
0.0649	1.9986	1440	0.2395	-2.8511	-8.5888	0.8778	5.7377	-1262.6172	-677.5837	3.8778	1.9376	0.0374	0.5116	0.4884	-3.0535	-10.0348	0.9097	6.9812	-1451.0132	-728.9337	3.5676	1.5730	0.1787	0.0427	-2.0820	-3.4336	0.7633	1.3516	-584.9677	-497.4562	5.1753	3.1000	0.5185	0.0724

Framework versions

Transformers 4.44.2
Pytorch 2.2.0a0+81ea7a4
Datasets 2.21.0
Tokenizers 0.19.1