ds_chat_sppo_hard_cosine_iter0_2024-09-17-09.48

This model is a fine-tuned version of deepseek-ai/deepseek-llm-7b-chat on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set:

Loss: 4613.5840
Rewards/chosen: 0.0069
Rewards/rejected: -0.0357
Rewards/accuracies: 0.6053
Rewards/margins: 0.0425
Logps/rejected: -263.2242
Logps/chosen: -252.2310
Logits/rejected: 1.4371
Logits/chosen: 1.3941
Debug/policy Chosen Logits: 1.3941
Debug/policy Rejected Logits: 1.4371
Debug/policy Chosen Logps: -252.2310
Debug/policy Rejected Logps: -263.2242
Debug/reference Chosen Logps: -252.9185
Debug/reference Rejected Logps: -259.6586
Debug/sppo Chosen Reward In Loss: 0.6874
Debug/sppo Rej Reward In Loss: -3.5656
Debug/sppo Chosen Loss: 2507.3259
Debug/sppo Reject Loss: 2312.9116

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-07
train_batch_size: 8
eval_batch_size: 4
seed: 42
distributed_type: multi-GPU
num_devices: 8
total_train_batch_size: 64
total_eval_batch_size: 32
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_ratio: 0.1
lr_scheduler_warmup_steps: 100
num_epochs: 8.0

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen	Debug/policy Chosen Logits	Debug/policy Rejected Logits	Debug/policy Chosen Logps	Debug/policy Rejected Logps	Debug/reference Chosen Logps	Debug/reference Rejected Logps	Debug/sppo Chosen Reward In Loss	Debug/sppo Rej Reward In Loss	Debug/sppo Chosen Loss	Debug/sppo Reject Loss
4970.1539	0.3623	100	4979.0801	-0.0031	-0.0046	0.5658	0.0014	-260.1172	-253.2325	1.6973	1.6355	1.6355	1.6973	-253.2325	-260.1172	-252.9185	-259.6586	-0.3140	-0.4586	2532.3372	2455.3159
4913.6875	0.7246	200	4922.2964	-0.0067	-0.0090	0.5395	0.0023	-260.5605	-253.5932	1.6658	1.6047	1.6047	1.6658	-253.5932	-260.5605	-252.9185	-259.6586	-0.6748	-0.9019	2570.3391	2415.1426
4852.6547	1.0870	300	4861.8960	-0.0090	-0.0170	0.4605	0.0079	-261.3568	-253.8218	1.6477	1.5895	1.5895	1.6477	-253.8218	-261.3568	-252.9185	-259.6586	-0.9033	-1.6982	2599.3752	2346.0071
4810.0602	1.4493	400	4799.1152	-0.0065	-0.0219	0.5395	0.0154	-261.8465	-253.5692	1.6033	1.5489	1.5489	1.6033	-253.5692	-261.8465	-252.9185	-259.6586	-0.6507	-2.1879	2584.1985	2322.5535
4686.3855	1.8116	500	4767.9019	-0.0146	-0.0351	0.5132	0.0205	-263.1680	-254.3759	1.5899	1.5348	1.5348	1.5899	-254.3759	-263.1680	-252.9185	-259.6586	-1.4575	-3.5093	2678.0864	2224.3416
4647.1707	2.1739	600	4725.6548	-0.0031	-0.0264	0.5395	0.0233	-262.3003	-253.2256	1.5586	1.5054	1.5054	1.5586	-253.2256	-262.3003	-252.9185	-259.6586	-0.3071	-2.6417	2562.3191	2304.5745
4590.507	2.5362	700	4709.8721	-0.0028	-0.0317	0.5658	0.0289	-262.8335	-253.2023	1.5311	1.4802	1.4802	1.5311	-253.2023	-262.8335	-252.9185	-259.6586	-0.2839	-3.1748	2563.1602	2266.7019
4624.6344	2.8986	800	4685.7876	-0.0021	-0.0328	0.6316	0.0307	-262.9392	-253.1265	1.5168	1.4660	1.4660	1.5168	-253.1265	-262.9392	-252.9185	-259.6586	-0.2080	-3.2806	2564.3735	2277.4634
4526.798	3.2609	900	4673.5791	-0.0010	-0.0339	0.5921	0.0329	-263.0450	-253.0172	1.5044	1.4543	1.4543	1.5044	-253.0172	-263.0450	-252.9185	-259.6586	-0.0987	-3.3863	2560.7192	2277.5515
4599.7109	3.6232	1000	4664.8169	0.0018	-0.0326	0.5658	0.0344	-262.9172	-252.7381	1.4973	1.4480	1.4480	1.4973	-252.7381	-262.9172	-252.9185	-259.6586	0.1804	-3.2586	2535.9368	2302.0969
4598.4699	3.9855	1100	4659.8091	0.0225	-0.0149	0.6579	0.0374	-261.1521	-250.6732	1.4704	1.4246	1.4246	1.4704	-250.6732	-261.1521	-252.9185	-259.6586	2.2452	-1.4935	2330.4351	2454.2285
4434.3441	4.3478	1200	4652.3701	-0.0064	-0.0448	0.5789	0.0383	-264.1339	-253.5595	1.4648	1.4176	1.4176	1.4648	-253.5595	-264.1339	-252.9185	-259.6586	-0.6410	-4.4752	2633.1008	2222.5164
4673.5336	4.7101	1300	4629.2358	0.0059	-0.0337	0.6053	0.0396	-263.0263	-252.3293	1.4597	1.4137	1.4137	1.4597	-252.3293	-263.0263	-252.9185	-259.6586	0.5892	-3.3676	2506.5920	2317.5457
4551.7766	5.0725	1400	4636.1592	0.0046	-0.0350	0.6053	0.0396	-263.1627	-252.4586	1.4595	1.4144	1.4144	1.4595	-252.4586	-263.1627	-252.9185	-259.6586	0.4598	-3.5041	2524.4553	2311.0466
4481.4781	5.4348	1500	4616.7266	0.0125	-0.0289	0.5921	0.0413	-262.5467	-251.6734	1.4468	1.4029	1.4029	1.4468	-251.6734	-262.5467	-252.9185	-259.6586	1.2451	-2.8881	2446.6792	2368.6218
4557.7566	5.7971	1600	4618.0537	0.0014	-0.0416	0.5921	0.0430	-263.8221	-252.7794	1.4428	1.3976	1.3976	1.4428	-252.7794	-263.8221	-252.9185	-259.6586	0.1390	-4.1635	2564.9141	2269.4070
4507.4234	6.1594	1700	4618.0	0.0009	-0.0413	0.5921	0.0422	-263.7893	-252.8316	1.4382	1.3934	1.3934	1.4382	-252.8316	-263.7893	-252.9185	-259.6586	0.0869	-4.1307	2573.3213	2274.9512
4566.6648	6.5217	1800	4619.3325	0.0061	-0.0369	0.5921	0.0430	-263.3517	-252.3105	1.4413	1.3975	1.3975	1.4413	-252.3105	-263.3517	-252.9185	-259.6586	0.6080	-3.6930	2512.9187	2304.7549
4682.7492	6.8841	1900	4616.8687	0.0066	-0.0366	0.5921	0.0432	-263.3144	-252.2579	1.4407	1.3967	1.3967	1.4407	-252.2579	-263.3144	-252.9185	-259.6586	0.6606	-3.6557	2507.0054	2307.5239
4486.1707	7.2464	2000	4616.3892	0.0062	-0.0377	0.5789	0.0439	-263.4255	-252.2975	1.4378	1.3932	1.3932	1.4378	-252.2975	-263.4255	-252.9185	-259.6586	0.6210	-3.7668	2509.9634	2298.5259
4477.8289	7.6087	2100	4617.2290	0.0069	-0.0354	0.5789	0.0423	-263.1952	-252.2293	1.4363	1.3925	1.3925	1.4363	-252.2293	-263.1952	-252.9185	-259.6586	0.6892	-3.5365	2506.2578	2318.2375
4520.1934	7.9710	2200	4613.5840	0.0069	-0.0357	0.6053	0.0425	-263.2242	-252.2310	1.4371	1.3941	1.3941	1.4371	-252.2310	-263.2242	-252.9185	-259.6586	0.6874	-3.5656	2507.3259	2312.9116

Framework versions

Transformers 4.42.0
Pytorch 2.3.0+cu121
Datasets 2.14.6
Tokenizers 0.19.1

yiran-wang3
/

ds_chat_sppo_hard_cosine_iter0_2024-09-17-09.48

ds_chat_sppo_hard_cosine_iter0_2024-09-17-09.48

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for yiran-wang3/ds_chat_sppo_hard_cosine_iter0_2024-09-17-09.48

Datasets used to train yiran-wang3/ds_chat_sppo_hard_cosine_iter0_2024-09-17-09.48

Evaluation results