model_hh_usp4_200

This model is a fine-tuned version of meta-llama/Llama-2-7b-chat-hf on an unknown dataset. It achieves the following results on the evaluation set:

Model description

More information needed

More information needed

More information needed

The following hyperparameters were used during training:

Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
8.0	100	2.1928	-10.7974	-11.7956	0.6200	0.9982	-127.1522	-122.5288	-0.0977	-0.0806
16.0	200	2.2073	-10.6145	-11.6045	0.6200	0.9900	-126.9398	-122.3256	-0.0957	-0.0785
24.0	300	2.1834	-10.5576	-11.5387	0.6100	0.9810	-126.8667	-122.2624	-0.0926	-0.0766
32.0	400	2.2206	-10.5218	-11.4563	0.6000	0.9345	-126.7752	-122.2226	-0.0924	-0.0768
40.0	500	2.1989	-10.4576	-11.4408	0.6100	0.9832	-126.7580	-122.1513	-0.0920	-0.0763
48.0	600	2.1897	-10.4344	-11.3970	0.6000	0.9626	-126.7093	-122.1255	-0.0915	-0.0758
56.0	700	2.1723	-10.3994	-11.3863	0.6000	0.9869	-126.6974	-122.0866	-0.0916	-0.0760
64.0	800	2.1910	-10.4312	-11.3832	0.6100	0.9520	-126.6939	-122.1220	-0.0918	-0.0760
72.0	900	2.1762	-10.4083	-11.3782	0.6100	0.9699	-126.6885	-122.0965	-0.0916	-0.0762
80.0	1000	2.2047	-10.4203	-11.3883	0.6100	0.9680	-126.6996	-122.1098	-0.0920	-0.0765