Safetensors
llama
alignment-handbook
trl
dpo
Generated from Trainer
Edit model card

Visualize in Weights & Biases

ds_chat_sppo_hard_cosine_iter0_2024-09-17-09.48

This model is a fine-tuned version of deepseek-ai/deepseek-llm-7b-chat on the self-generate/ds_chat_original_cn_mining_oj_iter0-binarized, the self-generate/ds_chat_original_cn_mining_sandbox_iter0-binarized and the self-generate/ds_chat_original_cn_rl_oj_iter0-binarized datasets. It achieves the following results on the evaluation set:

  • Loss: 4613.5840
  • Rewards/chosen: 0.0069
  • Rewards/rejected: -0.0357
  • Rewards/accuracies: 0.6053
  • Rewards/margins: 0.0425
  • Logps/rejected: -263.2242
  • Logps/chosen: -252.2310
  • Logits/rejected: 1.4371
  • Logits/chosen: 1.3941
  • Debug/policy Chosen Logits: 1.3941
  • Debug/policy Rejected Logits: 1.4371
  • Debug/policy Chosen Logps: -252.2310
  • Debug/policy Rejected Logps: -263.2242
  • Debug/reference Chosen Logps: -252.9185
  • Debug/reference Rejected Logps: -259.6586
  • Debug/sppo Chosen Reward In Loss: 0.6874
  • Debug/sppo Rej Reward In Loss: -3.5656
  • Debug/sppo Chosen Loss: 2507.3259
  • Debug/sppo Reject Loss: 2312.9116

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-07
  • train_batch_size: 8
  • eval_batch_size: 4
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • total_train_batch_size: 64
  • total_eval_batch_size: 32
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.1
  • lr_scheduler_warmup_steps: 100
  • num_epochs: 8.0

Training results

Training Loss Epoch Step Validation Loss Rewards/chosen Rewards/rejected Rewards/accuracies Rewards/margins Logps/rejected Logps/chosen Logits/rejected Logits/chosen Debug/policy Chosen Logits Debug/policy Rejected Logits Debug/policy Chosen Logps Debug/policy Rejected Logps Debug/reference Chosen Logps Debug/reference Rejected Logps Debug/sppo Chosen Reward In Loss Debug/sppo Rej Reward In Loss Debug/sppo Chosen Loss Debug/sppo Reject Loss
4970.1539 0.3623 100 4979.0801 -0.0031 -0.0046 0.5658 0.0014 -260.1172 -253.2325 1.6973 1.6355 1.6355 1.6973 -253.2325 -260.1172 -252.9185 -259.6586 -0.3140 -0.4586 2532.3372 2455.3159
4913.6875 0.7246 200 4922.2964 -0.0067 -0.0090 0.5395 0.0023 -260.5605 -253.5932 1.6658 1.6047 1.6047 1.6658 -253.5932 -260.5605 -252.9185 -259.6586 -0.6748 -0.9019 2570.3391 2415.1426
4852.6547 1.0870 300 4861.8960 -0.0090 -0.0170 0.4605 0.0079 -261.3568 -253.8218 1.6477 1.5895 1.5895 1.6477 -253.8218 -261.3568 -252.9185 -259.6586 -0.9033 -1.6982 2599.3752 2346.0071
4810.0602 1.4493 400 4799.1152 -0.0065 -0.0219 0.5395 0.0154 -261.8465 -253.5692 1.6033 1.5489 1.5489 1.6033 -253.5692 -261.8465 -252.9185 -259.6586 -0.6507 -2.1879 2584.1985 2322.5535
4686.3855 1.8116 500 4767.9019 -0.0146 -0.0351 0.5132 0.0205 -263.1680 -254.3759 1.5899 1.5348 1.5348 1.5899 -254.3759 -263.1680 -252.9185 -259.6586 -1.4575 -3.5093 2678.0864 2224.3416
4647.1707 2.1739 600 4725.6548 -0.0031 -0.0264 0.5395 0.0233 -262.3003 -253.2256 1.5586 1.5054 1.5054 1.5586 -253.2256 -262.3003 -252.9185 -259.6586 -0.3071 -2.6417 2562.3191 2304.5745
4590.507 2.5362 700 4709.8721 -0.0028 -0.0317 0.5658 0.0289 -262.8335 -253.2023 1.5311 1.4802 1.4802 1.5311 -253.2023 -262.8335 -252.9185 -259.6586 -0.2839 -3.1748 2563.1602 2266.7019
4624.6344 2.8986 800 4685.7876 -0.0021 -0.0328 0.6316 0.0307 -262.9392 -253.1265 1.5168 1.4660 1.4660 1.5168 -253.1265 -262.9392 -252.9185 -259.6586 -0.2080 -3.2806 2564.3735 2277.4634
4526.798 3.2609 900 4673.5791 -0.0010 -0.0339 0.5921 0.0329 -263.0450 -253.0172 1.5044 1.4543 1.4543 1.5044 -253.0172 -263.0450 -252.9185 -259.6586 -0.0987 -3.3863 2560.7192 2277.5515
4599.7109 3.6232 1000 4664.8169 0.0018 -0.0326 0.5658 0.0344 -262.9172 -252.7381 1.4973 1.4480 1.4480 1.4973 -252.7381 -262.9172 -252.9185 -259.6586 0.1804 -3.2586 2535.9368 2302.0969
4598.4699 3.9855 1100 4659.8091 0.0225 -0.0149 0.6579 0.0374 -261.1521 -250.6732 1.4704 1.4246 1.4246 1.4704 -250.6732 -261.1521 -252.9185 -259.6586 2.2452 -1.4935 2330.4351 2454.2285
4434.3441 4.3478 1200 4652.3701 -0.0064 -0.0448 0.5789 0.0383 -264.1339 -253.5595 1.4648 1.4176 1.4176 1.4648 -253.5595 -264.1339 -252.9185 -259.6586 -0.6410 -4.4752 2633.1008 2222.5164
4673.5336 4.7101 1300 4629.2358 0.0059 -0.0337 0.6053 0.0396 -263.0263 -252.3293 1.4597 1.4137 1.4137 1.4597 -252.3293 -263.0263 -252.9185 -259.6586 0.5892 -3.3676 2506.5920 2317.5457
4551.7766 5.0725 1400 4636.1592 0.0046 -0.0350 0.6053 0.0396 -263.1627 -252.4586 1.4595 1.4144 1.4144 1.4595 -252.4586 -263.1627 -252.9185 -259.6586 0.4598 -3.5041 2524.4553 2311.0466
4481.4781 5.4348 1500 4616.7266 0.0125 -0.0289 0.5921 0.0413 -262.5467 -251.6734 1.4468 1.4029 1.4029 1.4468 -251.6734 -262.5467 -252.9185 -259.6586 1.2451 -2.8881 2446.6792 2368.6218
4557.7566 5.7971 1600 4618.0537 0.0014 -0.0416 0.5921 0.0430 -263.8221 -252.7794 1.4428 1.3976 1.3976 1.4428 -252.7794 -263.8221 -252.9185 -259.6586 0.1390 -4.1635 2564.9141 2269.4070
4507.4234 6.1594 1700 4618.0 0.0009 -0.0413 0.5921 0.0422 -263.7893 -252.8316 1.4382 1.3934 1.3934 1.4382 -252.8316 -263.7893 -252.9185 -259.6586 0.0869 -4.1307 2573.3213 2274.9512
4566.6648 6.5217 1800 4619.3325 0.0061 -0.0369 0.5921 0.0430 -263.3517 -252.3105 1.4413 1.3975 1.3975 1.4413 -252.3105 -263.3517 -252.9185 -259.6586 0.6080 -3.6930 2512.9187 2304.7549
4682.7492 6.8841 1900 4616.8687 0.0066 -0.0366 0.5921 0.0432 -263.3144 -252.2579 1.4407 1.3967 1.3967 1.4407 -252.2579 -263.3144 -252.9185 -259.6586 0.6606 -3.6557 2507.0054 2307.5239
4486.1707 7.2464 2000 4616.3892 0.0062 -0.0377 0.5789 0.0439 -263.4255 -252.2975 1.4378 1.3932 1.3932 1.4378 -252.2975 -263.4255 -252.9185 -259.6586 0.6210 -3.7668 2509.9634 2298.5259
4477.8289 7.6087 2100 4617.2290 0.0069 -0.0354 0.5789 0.0423 -263.1952 -252.2293 1.4363 1.3925 1.3925 1.4363 -252.2293 -263.1952 -252.9185 -259.6586 0.6892 -3.5365 2506.2578 2318.2375
4520.1934 7.9710 2200 4613.5840 0.0069 -0.0357 0.6053 0.0425 -263.2242 -252.2310 1.4371 1.3941 1.3941 1.4371 -252.2310 -263.2242 -252.9185 -259.6586 0.6874 -3.5656 2507.3259 2312.9116

Framework versions

  • Transformers 4.42.0
  • Pytorch 2.3.0+cu121
  • Datasets 2.14.6
  • Tokenizers 0.19.1
Downloads last month
3
Safetensors
Model size
6.91B params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .

Model tree for yiran-wang3/ds_chat_sppo_hard_cosine_iter0_2024-09-17-09.48

Finetuned
(22)
this model

Datasets used to train yiran-wang3/ds_chat_sppo_hard_cosine_iter0_2024-09-17-09.48