gpt2-dpo-from_base_gpt2

This model is a fine-tuned version of openai-community/gpt2 on the None dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.6679	0.9993	668	0.6728	0.2747	0.2209	0.625	0.0538	-436.5490	-517.0669	-96.0258	-94.8005
0.6697	2.0	1337	0.6545	0.6507	0.5283	0.6295	0.1224	-433.4745	-513.3065	-96.0560	-94.8147
0.6516	2.9993	2005	0.6467	0.8424	0.6867	0.6336	0.1557	-431.8912	-511.3903	-96.1361	-94.8919
0.6264	4.0	2674	0.6436	0.9803	0.7989	0.6336	0.1814	-430.7686	-510.0109	-96.1278	-94.8762
0.6114	4.9993	3342	0.6420	1.0453	0.8518	0.6377	0.1935	-430.2403	-509.3612	-96.1435	-94.8917
0.6016	6.0	4011	0.6412	1.0870	0.8859	0.6377	0.2011	-429.8991	-508.9442	-96.1471	-94.8941
0.6115	6.9993	4679	0.6408	1.1137	0.9071	0.6384	0.2066	-429.6871	-508.6768	-96.1587	-94.9064
0.6079	8.0	5348	0.6406	1.1274	0.9178	0.6388	0.2096	-429.5802	-508.5403	-96.1573	-94.9046
0.6066	8.9993	6016	0.6406	1.1310	0.9207	0.6373	0.2103	-429.5507	-508.5036	-96.1593	-94.9068
0.5968	9.9925	6680	0.6406	1.1312	0.9208	0.6373	0.2103	-429.5498	-508.5024	-96.1598	-94.9073