Llama-2-7b-hf-DPO-LookAhead-5_Q2_TTree1.4_TT0.9_TP0.7_TE0.2_V5

This model is a fine-tuned version of meta-llama/Llama-2-7b-hf on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6642	0.2998	64	0.6861	0.1096	0.0712	0.7000	0.0384	-146.4445	-135.2473	0.3892	0.3857
0.7679	0.5995	128	0.6702	0.2151	0.1325	0.5	0.0826	-145.8312	-134.1918	0.3616	0.3586
0.6956	0.8993	192	0.7032	0.0993	0.1090	0.5	-0.0098	-146.0662	-135.3502	0.3503	0.3473
0.428	1.1991	256	0.7001	0.0275	-0.0647	0.5	0.0922	-147.8036	-136.0676	0.2753	0.2734
0.3326	1.4988	320	0.7460	-0.5011	-0.5860	0.6000	0.0849	-153.0164	-141.3538	0.1433	0.1439
0.498	1.7986	384	0.7965	-0.6044	-0.6122	0.5	0.0078	-153.2779	-142.3867	0.0688	0.0703
0.364	2.0984	448	0.8243	-0.7682	-0.6945	0.4000	-0.0737	-154.1017	-144.0248	0.0654	0.0667
0.2876	2.3981	512	0.8566	-1.3864	-1.3678	0.4000	-0.0186	-160.8344	-150.2071	-0.0854	-0.0815
0.0473	2.6979	576	0.8926	-1.4220	-1.3388	0.4000	-0.0832	-160.5441	-150.5625	-0.0984	-0.0941