llama-7b-SFT-qlora-eli5-wiki_DPO_ds_RM_top_2_1024_r_64_alpha_16

This model is a fine-tuned version of dhmeltzer/llama-7b-SFT_eli5_wiki65k_1024_r_64_alpha_16_merged on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6828	0.2	37	0.6867	-0.3470	-0.4719	0.5792	0.1249	-201.8816	-206.4072	0.7977	0.8213
0.6666	0.41	74	0.6731	-0.1233	-0.2593	0.5855	0.1361	-200.8187	-205.2885	0.8159	0.8381
0.6713	0.61	111	0.6645	0.0492	-0.1110	0.6019	0.1602	-200.0772	-204.4260	0.8299	0.8526
0.6749	0.82	148	0.6593	0.2291	0.0917	0.5912	0.1374	-199.0636	-203.5266	0.8189	0.8414
0.6688	1.02	185	0.6538	0.1408	-0.0291	0.6248	0.1699	-199.6676	-203.9681	0.8159	0.8393
0.3721	1.23	222	0.6911	-0.3548	-0.6171	0.6007	0.2623	-202.6077	-206.4462	0.8193	0.8406
0.2845	1.43	259	0.6989	-0.3528	-0.5968	0.5984	0.2441	-202.5062	-206.4359	0.7886	0.8059
0.2646	1.64	296	0.6991	-0.4016	-0.6359	0.5880	0.2343	-202.7015	-206.6800	0.7696	0.7875
0.2263	1.84	333	0.7063	-0.4773	-0.7137	0.5925	0.2365	-203.0908	-207.0584	0.7653	0.7833

Detailed results can be found here