zephyr-7b-dpo-qlora

This model is a fine-tuned version of alignment-handbook/zephyr-7b-sft-qlora on the HuggingFaceH4/ultrafeedback_binarized dataset. It achieves the following results on the evaluation set:

Loss: 0.4788
Rewards/chosen: -2.6215
Rewards/rejected: -3.9187
Rewards/accuracies: 0.7465
Rewards/margins: 1.2972
Logps/rejected: -636.4379
Logps/chosen: -526.7527
Logits/rejected: -1.0290
Logits/chosen: -1.1652

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-06
train_batch_size: 4
eval_batch_size: 8
seed: 42
distributed_type: multi-GPU
gradient_accumulation_steps: 4
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6807	0.0262	100	0.6809	0.0514	0.0256	0.6555	0.0258	-242.0131	-259.4604	-2.0551	-2.1482
0.6438	0.0523	200	0.6356	-0.1881	-0.3389	0.6760	0.1508	-278.4615	-283.4154	-2.0113	-2.1000
0.6073	0.0785	300	0.6054	-0.6866	-0.9744	0.6815	0.2878	-342.0091	-333.2583	-1.9949	-2.0782
0.5956	0.1047	400	0.5824	-1.4485	-1.9599	0.6830	0.5114	-440.5653	-409.4522	-1.5844	-1.6758
0.5643	0.1309	500	0.5726	-1.1458	-1.7589	0.6915	0.6131	-420.4636	-379.1804	-1.5624	-1.6658
0.5373	0.1570	600	0.5631	-1.1286	-1.8164	0.7030	0.6878	-426.2121	-377.4605	-1.6945	-1.7955
0.5394	0.1832	700	0.5474	-2.2700	-3.0663	0.7040	0.7963	-551.1992	-491.6012	-1.1628	-1.2719
0.4983	0.2094	800	0.5323	-1.5616	-2.2966	0.7225	0.7349	-474.2269	-420.7654	-1.5104	-1.5996
0.4763	0.2355	900	0.5386	-1.6130	-2.4122	0.7160	0.7992	-485.7890	-425.9030	-1.4156	-1.4989
0.5266	0.2617	1000	0.5234	-2.1788	-3.0546	0.7280	0.8758	-550.0311	-482.4831	-1.2043	-1.3050
0.59	0.2879	1100	0.5278	-1.6937	-2.3427	0.7300	0.6490	-478.8385	-433.9710	-0.9899	-1.1100
0.5724	0.3141	1200	0.5071	-1.5548	-2.4072	0.7380	0.8523	-485.2895	-420.0863	-1.1349	-1.2473
0.5457	0.3402	1300	0.5013	-1.7544	-2.6264	0.7435	0.8721	-507.2138	-440.0385	-1.2424	-1.3403
0.5423	0.3664	1400	0.5132	-1.6381	-2.6114	0.7210	0.9733	-505.7077	-428.4097	-1.5063	-1.5869
0.4492	0.3926	1500	0.5122	-1.5882	-2.5891	0.7260	1.0010	-503.4828	-423.4175	-1.4972	-1.5950
0.5491	0.4187	1600	0.4956	-1.6959	-2.7056	0.7395	1.0098	-515.1351	-434.1913	-1.1293	-1.2525
0.5408	0.4449	1700	0.5111	-3.0361	-4.2392	0.7305	1.2030	-668.4869	-568.2142	-1.0520	-1.1774
0.4705	0.4711	1800	0.4949	-2.1236	-3.1894	0.7435	1.0658	-563.5121	-476.9663	-1.3479	-1.4508
0.4447	0.4973	1900	0.4984	-2.0350	-3.1505	0.7420	1.1155	-559.6229	-468.1011	-1.1711	-1.2951
0.4561	0.5234	2000	0.4929	-1.9668	-2.9588	0.7420	0.9919	-540.4462	-461.2839	-1.3557	-1.4696
0.5068	0.5496	2100	0.4969	-3.1452	-4.3633	0.7350	1.2180	-680.8954	-579.1231	-1.1150	-1.2426
0.4839	0.5758	2200	0.4927	-2.3797	-3.4376	0.7405	1.0579	-588.3315	-502.5681	-1.2706	-1.3886
0.4729	0.6019	2300	0.4924	-2.8461	-4.1210	0.7405	1.2749	-656.6667	-549.2124	-1.0868	-1.2145
0.4501	0.6281	2400	0.4900	-2.9743	-4.2366	0.7430	1.2623	-668.2346	-562.0333	-0.9978	-1.1257
0.4982	0.6543	2500	0.4872	-2.4585	-3.6758	0.7420	1.2173	-612.1486	-510.4511	-1.0532	-1.1862
0.4649	0.6805	2600	0.4881	-2.5759	-3.8831	0.7450	1.3072	-632.8793	-522.1908	-1.0793	-1.2115
0.556	0.7066	2700	0.4841	-2.3432	-3.5113	0.7460	1.1680	-595.6959	-498.9265	-1.1004	-1.2295
0.4617	0.7328	2800	0.4832	-2.3495	-3.6183	0.7460	1.2689	-606.4033	-499.5496	-1.0627	-1.1960
0.4916	0.7590	2900	0.4800	-2.6711	-3.9165	0.7455	1.2454	-636.2195	-531.7142	-1.0032	-1.1418
0.4708	0.7851	3000	0.4797	-2.6166	-3.7883	0.7475	1.1717	-623.4008	-526.2621	-0.9962	-1.1355
0.4804	0.8113	3100	0.4807	-2.8224	-4.1220	0.7475	1.2996	-656.7728	-546.8435	-0.9953	-1.1341
0.4866	0.8375	3200	0.4777	-2.5496	-3.7894	0.7475	1.2398	-623.5103	-519.5614	-1.0276	-1.1641
0.4967	0.8636	3300	0.4786	-2.5578	-3.8108	0.7480	1.2530	-625.6535	-520.3804	-1.0241	-1.1608
0.4272	0.8898	3400	0.4797	-2.7223	-4.0287	0.7460	1.3065	-647.4435	-536.8282	-1.0071	-1.1445
0.5272	0.9160	3500	0.4797	-2.7144	-4.0320	0.7470	1.3176	-647.7730	-536.0449	-1.0233	-1.1601
0.4441	0.9422	3600	0.4790	-2.6459	-3.9513	0.7470	1.3054	-639.7043	-529.1944	-1.0278	-1.1641
0.4823	0.9683	3700	0.4789	-2.6279	-3.9262	0.7480	1.2982	-637.1880	-527.3952	-1.0329	-1.1687
0.4996	0.9945	3800	0.4788	-2.6215	-3.9183	0.7475	1.2968	-636.4029	-526.7561	-1.0296	-1.1658

Framework versions

PEFT 0.13.2
Transformers 4.45.2
Pytorch 2.1.2+cu121
Datasets 3.0.1
Tokenizers 0.20.1

guoqiang-x
/

zephyr-7b-dpo-qlora

zephyr-7b-dpo-qlora

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for guoqiang-x/zephyr-7b-dpo-qlora

Dataset used to train guoqiang-x/zephyr-7b-dpo-qlora

Evaluation results