just1nseo commited on
Commit
7300a6c
·
1 Parent(s): 56e7681

Model save

Browse files
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: peft
3
+ tags:
4
+ - trl
5
+ - dpo
6
+ - generated_from_trainer
7
+ base_model: allenai/tulu-2-13b
8
+ model-index:
9
+ - name: tulu2-13b-cost-UF-5e-7
10
+ results: []
11
+ ---
12
+
13
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
14
+ should probably proofread and complete it, then remove this comment. -->
15
+
16
+ # tulu2-13b-cost-UF-5e-7
17
+
18
+ This model is a fine-tuned version of [allenai/tulu-2-13b](https://huggingface.co/allenai/tulu-2-13b) on the None dataset.
19
+ It achieves the following results on the evaluation set:
20
+ - Loss: 0.6933
21
+ - Rewards/chosen: 0.0268
22
+ - Rewards/rejected: 0.0265
23
+ - Rewards/accuracies: 0.5360
24
+ - Rewards/margins: 0.0002
25
+ - Rewards/margins Max: 0.0601
26
+ - Rewards/margins Min: -0.0619
27
+ - Rewards/margins Std: 0.0406
28
+ - Logps/rejected: -327.5517
29
+ - Logps/chosen: -331.2336
30
+ - Logits/rejected: -0.8962
31
+ - Logits/chosen: -1.0222
32
+
33
+ ## Model description
34
+
35
+ More information needed
36
+
37
+ ## Intended uses & limitations
38
+
39
+ More information needed
40
+
41
+ ## Training and evaluation data
42
+
43
+ More information needed
44
+
45
+ ## Training procedure
46
+
47
+ ### Training hyperparameters
48
+
49
+ The following hyperparameters were used during training:
50
+ - learning_rate: 5e-07
51
+ - train_batch_size: 2
52
+ - eval_batch_size: 8
53
+ - seed: 42
54
+ - distributed_type: multi-GPU
55
+ - num_devices: 4
56
+ - gradient_accumulation_steps: 2
57
+ - total_train_batch_size: 16
58
+ - total_eval_batch_size: 32
59
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
60
+ - lr_scheduler_type: cosine
61
+ - lr_scheduler_warmup_ratio: 0.1
62
+ - num_epochs: 1
63
+
64
+ ### Training results
65
+
66
+ | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Rewards/margins Max | Rewards/margins Min | Rewards/margins Std | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen |
67
+ |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:-------------------:|:-------------------:|:-------------------:|:--------------:|:------------:|:---------------:|:-------------:|
68
+ | 0.6665 | 1.0 | 927 | 0.6933 | 0.0268 | 0.0265 | 0.5360 | 0.0002 | 0.0601 | -0.0619 | 0.0406 | -327.5517 | -331.2336 | -0.8962 | -1.0222 |
69
+
70
+
71
+ ### Framework versions
72
+
73
+ - PEFT 0.7.1
74
+ - Transformers 4.39.0.dev0
75
+ - Pytorch 2.1.2+cu121
76
+ - Datasets 2.14.6
77
+ - Tokenizers 0.15.2
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:1afe0993d246181c5974dbe138eebe76314d3cd0ca8d684a6aed5851e8203a9f
3
  size 1001466944
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:76eecdc522c84a35e4e686020cb6afb523d2d22fce65fcef5530fc0b6afa4ccf
3
  size 1001466944
all_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 1.0,
3
+ "train_loss": 0.6720605961327414,
4
+ "train_runtime": 8514.5547,
5
+ "train_samples": 14828,
6
+ "train_samples_per_second": 1.741,
7
+ "train_steps_per_second": 0.109
8
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 1.0,
3
+ "train_loss": 0.6720605961327414,
4
+ "train_runtime": 8514.5547,
5
+ "train_samples": 14828,
6
+ "train_samples_per_second": 1.741,
7
+ "train_steps_per_second": 0.109
8
+ }
trainer_state.json ADDED
@@ -0,0 +1,1723 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "best_metric": null,
3
+ "best_model_checkpoint": null,
4
+ "epoch": 1.0,
5
+ "eval_steps": 100,
6
+ "global_step": 927,
7
+ "is_hyper_param_search": false,
8
+ "is_local_process_zero": true,
9
+ "is_world_process_zero": true,
10
+ "log_history": [
11
+ {
12
+ "epoch": 0.0,
13
+ "grad_norm": 0.333984375,
14
+ "learning_rate": 5.3763440860215056e-09,
15
+ "logits/chosen": -1.7726776599884033,
16
+ "logits/rejected": -1.019553542137146,
17
+ "logps/chosen": -227.8472900390625,
18
+ "logps/rejected": -244.70220947265625,
19
+ "loss": 0.6931,
20
+ "rewards/accuracies": 0.0,
21
+ "rewards/chosen": 0.0,
22
+ "rewards/margins": 0.0,
23
+ "rewards/margins_max": 0.0,
24
+ "rewards/margins_min": 0.0,
25
+ "rewards/margins_std": 0.0,
26
+ "rewards/rejected": 0.0,
27
+ "step": 1
28
+ },
29
+ {
30
+ "epoch": 0.01,
31
+ "grad_norm": 0.33203125,
32
+ "learning_rate": 5.3763440860215054e-08,
33
+ "logits/chosen": -1.2758857011795044,
34
+ "logits/rejected": -0.7481470108032227,
35
+ "logps/chosen": -294.4023132324219,
36
+ "logps/rejected": -209.6774139404297,
37
+ "loss": 0.6931,
38
+ "rewards/accuracies": 0.4722222089767456,
39
+ "rewards/chosen": 0.00042969180503860116,
40
+ "rewards/margins": 0.00026647368213161826,
41
+ "rewards/margins_max": 0.002202093368396163,
42
+ "rewards/margins_min": -0.0016691461205482483,
43
+ "rewards/margins_std": 0.002737379400059581,
44
+ "rewards/rejected": 0.00016321813745889813,
45
+ "step": 10
46
+ },
47
+ {
48
+ "epoch": 0.02,
49
+ "grad_norm": 0.3515625,
50
+ "learning_rate": 1.0752688172043011e-07,
51
+ "logits/chosen": -1.4951783418655396,
52
+ "logits/rejected": -1.0168498754501343,
53
+ "logps/chosen": -280.9791259765625,
54
+ "logps/rejected": -274.51055908203125,
55
+ "loss": 0.6928,
56
+ "rewards/accuracies": 0.574999988079071,
57
+ "rewards/chosen": 0.0005079669645056129,
58
+ "rewards/margins": 0.0006722012185491621,
59
+ "rewards/margins_max": 0.003316085785627365,
60
+ "rewards/margins_min": -0.001971683232113719,
61
+ "rewards/margins_std": 0.003739017527550459,
62
+ "rewards/rejected": -0.00016423416673205793,
63
+ "step": 20
64
+ },
65
+ {
66
+ "epoch": 0.03,
67
+ "grad_norm": 0.2734375,
68
+ "learning_rate": 1.6129032258064515e-07,
69
+ "logits/chosen": -1.1558444499969482,
70
+ "logits/rejected": -0.7869107723236084,
71
+ "logps/chosen": -235.8018035888672,
72
+ "logps/rejected": -239.435791015625,
73
+ "loss": 0.6926,
74
+ "rewards/accuracies": 0.625,
75
+ "rewards/chosen": 0.0022618253715336323,
76
+ "rewards/margins": 0.001511804643087089,
77
+ "rewards/margins_max": 0.004373847972601652,
78
+ "rewards/margins_min": -0.001350238686427474,
79
+ "rewards/margins_std": 0.004047540482133627,
80
+ "rewards/rejected": 0.000750020903069526,
81
+ "step": 30
82
+ },
83
+ {
84
+ "epoch": 0.04,
85
+ "grad_norm": 0.2421875,
86
+ "learning_rate": 2.1505376344086022e-07,
87
+ "logits/chosen": -1.4029566049575806,
88
+ "logits/rejected": -0.8758159875869751,
89
+ "logps/chosen": -256.91668701171875,
90
+ "logps/rejected": -245.5706024169922,
91
+ "loss": 0.6927,
92
+ "rewards/accuracies": 0.5249999761581421,
93
+ "rewards/chosen": 0.0011851616436615586,
94
+ "rewards/margins": 0.0004023304209113121,
95
+ "rewards/margins_max": 0.002305095549672842,
96
+ "rewards/margins_min": -0.0015004349406808615,
97
+ "rewards/margins_std": 0.0026909164153039455,
98
+ "rewards/rejected": 0.0007828312227502465,
99
+ "step": 40
100
+ },
101
+ {
102
+ "epoch": 0.05,
103
+ "grad_norm": 0.25390625,
104
+ "learning_rate": 2.6881720430107523e-07,
105
+ "logits/chosen": -1.5180784463882446,
106
+ "logits/rejected": -1.0010168552398682,
107
+ "logps/chosen": -202.52719116210938,
108
+ "logps/rejected": -197.68511962890625,
109
+ "loss": 0.6926,
110
+ "rewards/accuracies": 0.574999988079071,
111
+ "rewards/chosen": 0.0013484725495800376,
112
+ "rewards/margins": 0.0014005316188558936,
113
+ "rewards/margins_max": 0.0035043410025537014,
114
+ "rewards/margins_min": -0.0007032775320112705,
115
+ "rewards/margins_std": 0.002975235693156719,
116
+ "rewards/rejected": -5.2059163863305e-05,
117
+ "step": 50
118
+ },
119
+ {
120
+ "epoch": 0.06,
121
+ "grad_norm": 0.287109375,
122
+ "learning_rate": 3.225806451612903e-07,
123
+ "logits/chosen": -1.5808765888214111,
124
+ "logits/rejected": -0.893613338470459,
125
+ "logps/chosen": -291.6551208496094,
126
+ "logps/rejected": -255.85360717773438,
127
+ "loss": 0.6922,
128
+ "rewards/accuracies": 0.574999988079071,
129
+ "rewards/chosen": 0.0011142367729917169,
130
+ "rewards/margins": 0.0015284843975678086,
131
+ "rewards/margins_max": 0.0042257350869476795,
132
+ "rewards/margins_min": -0.0011687660589814186,
133
+ "rewards/margins_std": 0.003814487950876355,
134
+ "rewards/rejected": -0.0004142475372646004,
135
+ "step": 60
136
+ },
137
+ {
138
+ "epoch": 0.08,
139
+ "grad_norm": 0.296875,
140
+ "learning_rate": 3.7634408602150537e-07,
141
+ "logits/chosen": -1.6458534002304077,
142
+ "logits/rejected": -0.9159662127494812,
143
+ "logps/chosen": -353.4042053222656,
144
+ "logps/rejected": -295.0526428222656,
145
+ "loss": 0.6916,
146
+ "rewards/accuracies": 0.75,
147
+ "rewards/chosen": 0.003586581675335765,
148
+ "rewards/margins": 0.0029636994004249573,
149
+ "rewards/margins_max": 0.0056008645333349705,
150
+ "rewards/margins_min": 0.00032653429661877453,
151
+ "rewards/margins_std": 0.0037295143119990826,
152
+ "rewards/rejected": 0.0006228827987797558,
153
+ "step": 70
154
+ },
155
+ {
156
+ "epoch": 0.09,
157
+ "grad_norm": 0.283203125,
158
+ "learning_rate": 4.3010752688172043e-07,
159
+ "logits/chosen": -1.3944181203842163,
160
+ "logits/rejected": -0.9768520593643188,
161
+ "logps/chosen": -247.8656463623047,
162
+ "logps/rejected": -222.736083984375,
163
+ "loss": 0.691,
164
+ "rewards/accuracies": 0.75,
165
+ "rewards/chosen": 0.004138199612498283,
166
+ "rewards/margins": 0.0038313076365739107,
167
+ "rewards/margins_max": 0.007481383625417948,
168
+ "rewards/margins_min": 0.00018123061454389244,
169
+ "rewards/margins_std": 0.005161988083273172,
170
+ "rewards/rejected": 0.00030689238337799907,
171
+ "step": 80
172
+ },
173
+ {
174
+ "epoch": 0.1,
175
+ "grad_norm": 0.263671875,
176
+ "learning_rate": 4.838709677419355e-07,
177
+ "logits/chosen": -1.4816954135894775,
178
+ "logits/rejected": -1.0011855363845825,
179
+ "logps/chosen": -307.66729736328125,
180
+ "logps/rejected": -218.35110473632812,
181
+ "loss": 0.6901,
182
+ "rewards/accuracies": 0.925000011920929,
183
+ "rewards/chosen": 0.006525175180286169,
184
+ "rewards/margins": 0.006856041494756937,
185
+ "rewards/margins_max": 0.012032730504870415,
186
+ "rewards/margins_min": 0.0016793517861515284,
187
+ "rewards/margins_std": 0.007320943288505077,
188
+ "rewards/rejected": -0.0003308658779133111,
189
+ "step": 90
190
+ },
191
+ {
192
+ "epoch": 0.11,
193
+ "grad_norm": 0.275390625,
194
+ "learning_rate": 4.999130942376231e-07,
195
+ "logits/chosen": -1.4475181102752686,
196
+ "logits/rejected": -0.8290830850601196,
197
+ "logps/chosen": -247.40322875976562,
198
+ "logps/rejected": -220.42355346679688,
199
+ "loss": 0.6897,
200
+ "rewards/accuracies": 0.875,
201
+ "rewards/chosen": 0.007414130959659815,
202
+ "rewards/margins": 0.0059156701900064945,
203
+ "rewards/margins_max": 0.00915153045207262,
204
+ "rewards/margins_min": 0.0026798094622790813,
205
+ "rewards/margins_std": 0.004576197825372219,
206
+ "rewards/rejected": 0.0014984606532379985,
207
+ "step": 100
208
+ },
209
+ {
210
+ "epoch": 0.12,
211
+ "grad_norm": 0.328125,
212
+ "learning_rate": 4.994875788073206e-07,
213
+ "logits/chosen": -1.3422911167144775,
214
+ "logits/rejected": -0.9277170300483704,
215
+ "logps/chosen": -265.04791259765625,
216
+ "logps/rejected": -291.83612060546875,
217
+ "loss": 0.6882,
218
+ "rewards/accuracies": 0.925000011920929,
219
+ "rewards/chosen": 0.009832861833274364,
220
+ "rewards/margins": 0.010079814121127129,
221
+ "rewards/margins_max": 0.014872267842292786,
222
+ "rewards/margins_min": 0.005287360865622759,
223
+ "rewards/margins_std": 0.006777553353458643,
224
+ "rewards/rejected": -0.00024695199681445956,
225
+ "step": 110
226
+ },
227
+ {
228
+ "epoch": 0.13,
229
+ "grad_norm": 0.3515625,
230
+ "learning_rate": 4.987080943856886e-07,
231
+ "logits/chosen": -1.4355990886688232,
232
+ "logits/rejected": -0.9039069414138794,
233
+ "logps/chosen": -241.0117950439453,
234
+ "logps/rejected": -261.8793640136719,
235
+ "loss": 0.6878,
236
+ "rewards/accuracies": 0.875,
237
+ "rewards/chosen": 0.010962730273604393,
238
+ "rewards/margins": 0.011187642812728882,
239
+ "rewards/margins_max": 0.01832672953605652,
240
+ "rewards/margins_min": 0.0040485551580786705,
241
+ "rewards/margins_std": 0.010096193291246891,
242
+ "rewards/rejected": -0.0002249126264359802,
243
+ "step": 120
244
+ },
245
+ {
246
+ "epoch": 0.14,
247
+ "grad_norm": 0.318359375,
248
+ "learning_rate": 4.975757468927726e-07,
249
+ "logits/chosen": -1.5994830131530762,
250
+ "logits/rejected": -0.8840494155883789,
251
+ "logps/chosen": -262.6464538574219,
252
+ "logps/rejected": -225.1462860107422,
253
+ "loss": 0.6878,
254
+ "rewards/accuracies": 0.875,
255
+ "rewards/chosen": 0.01504091639071703,
256
+ "rewards/margins": 0.014881642535328865,
257
+ "rewards/margins_max": 0.021485231816768646,
258
+ "rewards/margins_min": 0.008278051391243935,
259
+ "rewards/margins_std": 0.00933888740837574,
260
+ "rewards/rejected": 0.00015927411732263863,
261
+ "step": 130
262
+ },
263
+ {
264
+ "epoch": 0.15,
265
+ "grad_norm": 0.28125,
266
+ "learning_rate": 4.960921428851066e-07,
267
+ "logits/chosen": -1.3022377490997314,
268
+ "logits/rejected": -0.9370013475418091,
269
+ "logps/chosen": -238.2805938720703,
270
+ "logps/rejected": -267.19854736328125,
271
+ "loss": 0.6864,
272
+ "rewards/accuracies": 0.8500000238418579,
273
+ "rewards/chosen": 0.015973640605807304,
274
+ "rewards/margins": 0.014827728271484375,
275
+ "rewards/margins_max": 0.02404005080461502,
276
+ "rewards/margins_min": 0.005615406669676304,
277
+ "rewards/margins_std": 0.013028192333877087,
278
+ "rewards/rejected": 0.001145911985076964,
279
+ "step": 140
280
+ },
281
+ {
282
+ "epoch": 0.16,
283
+ "grad_norm": 0.37890625,
284
+ "learning_rate": 4.942593872763566e-07,
285
+ "logits/chosen": -1.4023020267486572,
286
+ "logits/rejected": -0.8060510754585266,
287
+ "logps/chosen": -228.21420288085938,
288
+ "logps/rejected": -223.0884246826172,
289
+ "loss": 0.6856,
290
+ "rewards/accuracies": 0.875,
291
+ "rewards/chosen": 0.01750522293150425,
292
+ "rewards/margins": 0.016751740127801895,
293
+ "rewards/margins_max": 0.026506105437874794,
294
+ "rewards/margins_min": 0.006997367832809687,
295
+ "rewards/margins_std": 0.013794762082397938,
296
+ "rewards/rejected": 0.0007534866454079747,
297
+ "step": 150
298
+ },
299
+ {
300
+ "epoch": 0.17,
301
+ "grad_norm": 0.345703125,
302
+ "learning_rate": 4.920800803509025e-07,
303
+ "logits/chosen": -1.6911264657974243,
304
+ "logits/rejected": -0.8840080499649048,
305
+ "logps/chosen": -303.69427490234375,
306
+ "logps/rejected": -275.5614318847656,
307
+ "loss": 0.6842,
308
+ "rewards/accuracies": 0.949999988079071,
309
+ "rewards/chosen": 0.022266970947384834,
310
+ "rewards/margins": 0.02082091197371483,
311
+ "rewards/margins_max": 0.02934589982032776,
312
+ "rewards/margins_min": 0.012295925989747047,
313
+ "rewards/margins_std": 0.012056154198944569,
314
+ "rewards/rejected": 0.0014460586244240403,
315
+ "step": 160
316
+ },
317
+ {
318
+ "epoch": 0.18,
319
+ "grad_norm": 0.3046875,
320
+ "learning_rate": 4.895573140745967e-07,
321
+ "logits/chosen": -1.5039719343185425,
322
+ "logits/rejected": -1.0316977500915527,
323
+ "logps/chosen": -345.7312316894531,
324
+ "logps/rejected": -288.9708251953125,
325
+ "loss": 0.6836,
326
+ "rewards/accuracies": 0.949999988079071,
327
+ "rewards/chosen": 0.02365526184439659,
328
+ "rewards/margins": 0.02123275026679039,
329
+ "rewards/margins_max": 0.03283926099538803,
330
+ "rewards/margins_min": 0.009626244194805622,
331
+ "rewards/margins_std": 0.016414081677794456,
332
+ "rewards/rejected": 0.002422512974590063,
333
+ "step": 170
334
+ },
335
+ {
336
+ "epoch": 0.19,
337
+ "grad_norm": 0.265625,
338
+ "learning_rate": 4.866946677079314e-07,
339
+ "logits/chosen": -1.5601823329925537,
340
+ "logits/rejected": -1.0506742000579834,
341
+ "logps/chosen": -231.99703979492188,
342
+ "logps/rejected": -237.2670135498047,
343
+ "loss": 0.6828,
344
+ "rewards/accuracies": 0.8500000238418579,
345
+ "rewards/chosen": 0.022809389978647232,
346
+ "rewards/margins": 0.020736858248710632,
347
+ "rewards/margins_max": 0.030726289376616478,
348
+ "rewards/margins_min": 0.010747427120804787,
349
+ "rewards/margins_std": 0.014127190224826336,
350
+ "rewards/rejected": 0.002072530798614025,
351
+ "step": 180
352
+ },
353
+ {
354
+ "epoch": 0.2,
355
+ "grad_norm": 0.365234375,
356
+ "learning_rate": 4.834962027278417e-07,
357
+ "logits/chosen": -1.4942362308502197,
358
+ "logits/rejected": -0.8470790982246399,
359
+ "logps/chosen": -291.2981872558594,
360
+ "logps/rejected": -240.0912628173828,
361
+ "loss": 0.682,
362
+ "rewards/accuracies": 0.925000011920929,
363
+ "rewards/chosen": 0.017627181485295296,
364
+ "rewards/margins": 0.022342149168252945,
365
+ "rewards/margins_max": 0.033914387226104736,
366
+ "rewards/margins_min": 0.01076990831643343,
367
+ "rewards/margins_std": 0.016365615651011467,
368
+ "rewards/rejected": -0.004714967682957649,
369
+ "step": 190
370
+ },
371
+ {
372
+ "epoch": 0.22,
373
+ "grad_norm": 0.345703125,
374
+ "learning_rate": 4.799664570653473e-07,
375
+ "logits/chosen": -1.4884960651397705,
376
+ "logits/rejected": -0.7421751022338867,
377
+ "logps/chosen": -295.2840270996094,
378
+ "logps/rejected": -230.6145477294922,
379
+ "loss": 0.6808,
380
+ "rewards/accuracies": 1.0,
381
+ "rewards/chosen": 0.02664310857653618,
382
+ "rewards/margins": 0.031296949833631516,
383
+ "rewards/margins_max": 0.04525149241089821,
384
+ "rewards/margins_min": 0.01734241284430027,
385
+ "rewards/margins_std": 0.0197346992790699,
386
+ "rewards/rejected": -0.00465384079143405,
387
+ "step": 200
388
+ },
389
+ {
390
+ "epoch": 0.23,
391
+ "grad_norm": 0.2578125,
392
+ "learning_rate": 4.7611043866720737e-07,
393
+ "logits/chosen": -1.49364173412323,
394
+ "logits/rejected": -1.031049132347107,
395
+ "logps/chosen": -258.94512939453125,
396
+ "logps/rejected": -287.6114196777344,
397
+ "loss": 0.6808,
398
+ "rewards/accuracies": 0.925000011920929,
399
+ "rewards/chosen": 0.021609986200928688,
400
+ "rewards/margins": 0.020240817219018936,
401
+ "rewards/margins_max": 0.033506136387586594,
402
+ "rewards/margins_min": 0.006975496653467417,
403
+ "rewards/margins_std": 0.018759997561573982,
404
+ "rewards/rejected": 0.0013691672356799245,
405
+ "step": 210
406
+ },
407
+ {
408
+ "epoch": 0.24,
409
+ "grad_norm": 0.291015625,
410
+ "learning_rate": 4.719336183907265e-07,
411
+ "logits/chosen": -1.3044389486312866,
412
+ "logits/rejected": -0.9332467317581177,
413
+ "logps/chosen": -223.7823944091797,
414
+ "logps/rejected": -217.8594207763672,
415
+ "loss": 0.6799,
416
+ "rewards/accuracies": 0.925000011920929,
417
+ "rewards/chosen": 0.02070688083767891,
418
+ "rewards/margins": 0.020120607689023018,
419
+ "rewards/margins_max": 0.03160488232970238,
420
+ "rewards/margins_min": 0.008636328391730785,
421
+ "rewards/margins_std": 0.016241220757365227,
422
+ "rewards/rejected": 0.0005862751277163625,
423
+ "step": 220
424
+ },
425
+ {
426
+ "epoch": 0.25,
427
+ "grad_norm": 0.3125,
428
+ "learning_rate": 4.6744192224178984e-07,
429
+ "logits/chosen": -1.3842111825942993,
430
+ "logits/rejected": -0.9888660311698914,
431
+ "logps/chosen": -236.2473907470703,
432
+ "logps/rejected": -271.76641845703125,
433
+ "loss": 0.6802,
434
+ "rewards/accuracies": 0.925000011920929,
435
+ "rewards/chosen": 0.022707760334014893,
436
+ "rewards/margins": 0.02468196675181389,
437
+ "rewards/margins_max": 0.03745966777205467,
438
+ "rewards/margins_min": 0.011904269456863403,
439
+ "rewards/margins_std": 0.018070396035909653,
440
+ "rewards/rejected": -0.0019742068834602833,
441
+ "step": 230
442
+ },
443
+ {
444
+ "epoch": 0.26,
445
+ "grad_norm": 0.30078125,
446
+ "learning_rate": 4.6264172296714e-07,
447
+ "logits/chosen": -1.468925952911377,
448
+ "logits/rejected": -0.8928337097167969,
449
+ "logps/chosen": -218.7871551513672,
450
+ "logps/rejected": -232.1916961669922,
451
+ "loss": 0.6782,
452
+ "rewards/accuracies": 0.949999988079071,
453
+ "rewards/chosen": 0.029307205229997635,
454
+ "rewards/margins": 0.03086397610604763,
455
+ "rewards/margins_max": 0.0454208180308342,
456
+ "rewards/margins_min": 0.016307134181261063,
457
+ "rewards/margins_std": 0.02058648318052292,
458
+ "rewards/rejected": -0.0015567743685096502,
459
+ "step": 240
460
+ },
461
+ {
462
+ "epoch": 0.27,
463
+ "grad_norm": 0.263671875,
464
+ "learning_rate": 4.575398310128262e-07,
465
+ "logits/chosen": -1.52159583568573,
466
+ "logits/rejected": -1.062409520149231,
467
+ "logps/chosen": -205.5279083251953,
468
+ "logps/rejected": -209.3750457763672,
469
+ "loss": 0.6784,
470
+ "rewards/accuracies": 0.9750000238418579,
471
+ "rewards/chosen": 0.0256162341684103,
472
+ "rewards/margins": 0.029211264103651047,
473
+ "rewards/margins_max": 0.04594603180885315,
474
+ "rewards/margins_min": 0.012476496398448944,
475
+ "rewards/margins_std": 0.023666534572839737,
476
+ "rewards/rejected": -0.0035950313322246075,
477
+ "step": 250
478
+ },
479
+ {
480
+ "epoch": 0.28,
481
+ "grad_norm": 0.28515625,
482
+ "learning_rate": 4.5214348486165227e-07,
483
+ "logits/chosen": -1.4600447416305542,
484
+ "logits/rejected": -1.0638010501861572,
485
+ "logps/chosen": -263.8650817871094,
486
+ "logps/rejected": -257.2422790527344,
487
+ "loss": 0.6772,
488
+ "rewards/accuracies": 0.9750000238418579,
489
+ "rewards/chosen": 0.031198721379041672,
490
+ "rewards/margins": 0.032171688973903656,
491
+ "rewards/margins_max": 0.04676546901464462,
492
+ "rewards/margins_min": 0.01757790893316269,
493
+ "rewards/margins_std": 0.02063872292637825,
494
+ "rewards/rejected": -0.0009729691664688289,
495
+ "step": 260
496
+ },
497
+ {
498
+ "epoch": 0.29,
499
+ "grad_norm": 0.333984375,
500
+ "learning_rate": 4.4646034076333254e-07,
501
+ "logits/chosen": -1.4118484258651733,
502
+ "logits/rejected": -1.0390139818191528,
503
+ "logps/chosen": -254.69589233398438,
504
+ "logps/rejected": -277.36029052734375,
505
+ "loss": 0.6747,
506
+ "rewards/accuracies": 0.9750000238418579,
507
+ "rewards/chosen": 0.03478916734457016,
508
+ "rewards/margins": 0.03131258860230446,
509
+ "rewards/margins_max": 0.0460633859038353,
510
+ "rewards/margins_min": 0.016561787575483322,
511
+ "rewards/margins_std": 0.020860780030488968,
512
+ "rewards/rejected": 0.0034765794407576323,
513
+ "step": 270
514
+ },
515
+ {
516
+ "epoch": 0.3,
517
+ "grad_norm": 0.337890625,
518
+ "learning_rate": 4.404984618719274e-07,
519
+ "logits/chosen": -1.5356388092041016,
520
+ "logits/rejected": -0.8826289176940918,
521
+ "logps/chosen": -213.24365234375,
522
+ "logps/rejected": -205.28201293945312,
523
+ "loss": 0.674,
524
+ "rewards/accuracies": 0.949999988079071,
525
+ "rewards/chosen": 0.03563285619020462,
526
+ "rewards/margins": 0.044356830418109894,
527
+ "rewards/margins_max": 0.05965030938386917,
528
+ "rewards/margins_min": 0.029063349589705467,
529
+ "rewards/margins_std": 0.0216282457113266,
530
+ "rewards/rejected": -0.008723974227905273,
531
+ "step": 280
532
+ },
533
+ {
534
+ "epoch": 0.31,
535
+ "grad_norm": 0.322265625,
536
+ "learning_rate": 4.342663068059689e-07,
537
+ "logits/chosen": -1.5109997987747192,
538
+ "logits/rejected": -1.0301908254623413,
539
+ "logps/chosen": -227.60043334960938,
540
+ "logps/rejected": -225.4455108642578,
541
+ "loss": 0.6737,
542
+ "rewards/accuracies": 0.9750000238418579,
543
+ "rewards/chosen": 0.033510081470012665,
544
+ "rewards/margins": 0.03692306950688362,
545
+ "rewards/margins_max": 0.05596238374710083,
546
+ "rewards/margins_min": 0.017883744090795517,
547
+ "rewards/margins_std": 0.026925668120384216,
548
+ "rewards/rejected": -0.0034129873383790255,
549
+ "step": 290
550
+ },
551
+ {
552
+ "epoch": 0.32,
553
+ "grad_norm": 0.361328125,
554
+ "learning_rate": 4.27772717647508e-07,
555
+ "logits/chosen": -1.4661680459976196,
556
+ "logits/rejected": -1.0104472637176514,
557
+ "logps/chosen": -242.0974578857422,
558
+ "logps/rejected": -232.9008026123047,
559
+ "loss": 0.6744,
560
+ "rewards/accuracies": 0.9750000238418579,
561
+ "rewards/chosen": 0.030629118904471397,
562
+ "rewards/margins": 0.03372279554605484,
563
+ "rewards/margins_max": 0.050439924001693726,
564
+ "rewards/margins_min": 0.017005670815706253,
565
+ "rewards/margins_std": 0.023641586303710938,
566
+ "rewards/rejected": -0.00309367710724473,
567
+ "step": 300
568
+ },
569
+ {
570
+ "epoch": 0.33,
571
+ "grad_norm": 0.294921875,
572
+ "learning_rate": 4.2102690739710975e-07,
573
+ "logits/chosen": -1.35811448097229,
574
+ "logits/rejected": -0.8667371869087219,
575
+ "logps/chosen": -207.32943725585938,
576
+ "logps/rejected": -239.2294464111328,
577
+ "loss": 0.67,
578
+ "rewards/accuracies": 0.925000011920929,
579
+ "rewards/chosen": 0.04066639393568039,
580
+ "rewards/margins": 0.040916211903095245,
581
+ "rewards/margins_max": 0.05701497197151184,
582
+ "rewards/margins_min": 0.02481745555996895,
583
+ "rewards/margins_std": 0.0227670781314373,
584
+ "rewards/rejected": -0.0002498172107152641,
585
+ "step": 310
586
+ },
587
+ {
588
+ "epoch": 0.35,
589
+ "grad_norm": 0.302734375,
590
+ "learning_rate": 4.140384469025954e-07,
591
+ "logits/chosen": -1.4642293453216553,
592
+ "logits/rejected": -0.8783510327339172,
593
+ "logps/chosen": -269.0649719238281,
594
+ "logps/rejected": -253.2615966796875,
595
+ "loss": 0.6722,
596
+ "rewards/accuracies": 0.9750000238418579,
597
+ "rewards/chosen": 0.042783185839653015,
598
+ "rewards/margins": 0.04893990606069565,
599
+ "rewards/margins_max": 0.07330337911844254,
600
+ "rewards/margins_min": 0.024576421827077866,
601
+ "rewards/margins_std": 0.03445516526699066,
602
+ "rewards/rejected": -0.0061567178927361965,
603
+ "step": 320
604
+ },
605
+ {
606
+ "epoch": 0.36,
607
+ "grad_norm": 0.326171875,
608
+ "learning_rate": 4.068172512800759e-07,
609
+ "logits/chosen": -1.4688003063201904,
610
+ "logits/rejected": -0.9481694102287292,
611
+ "logps/chosen": -263.0448303222656,
612
+ "logps/rejected": -261.3179931640625,
613
+ "loss": 0.6708,
614
+ "rewards/accuracies": 0.9750000238418579,
615
+ "rewards/chosen": 0.03902563825249672,
616
+ "rewards/margins": 0.04815050959587097,
617
+ "rewards/margins_max": 0.06747350096702576,
618
+ "rewards/margins_min": 0.02882750704884529,
619
+ "rewards/margins_std": 0.02732684649527073,
620
+ "rewards/rejected": -0.009124869480729103,
621
+ "step": 330
622
+ },
623
+ {
624
+ "epoch": 0.37,
625
+ "grad_norm": 0.271484375,
626
+ "learning_rate": 3.993735658465446e-07,
627
+ "logits/chosen": -1.3654890060424805,
628
+ "logits/rejected": -1.0353434085845947,
629
+ "logps/chosen": -227.32803344726562,
630
+ "logps/rejected": -255.8052520751953,
631
+ "loss": 0.6708,
632
+ "rewards/accuracies": 0.949999988079071,
633
+ "rewards/chosen": 0.035697564482688904,
634
+ "rewards/margins": 0.03952573984861374,
635
+ "rewards/margins_max": 0.06120690703392029,
636
+ "rewards/margins_min": 0.01784456893801689,
637
+ "rewards/margins_std": 0.030661800876259804,
638
+ "rewards/rejected": -0.0038281746674329042,
639
+ "step": 340
640
+ },
641
+ {
642
+ "epoch": 0.38,
643
+ "grad_norm": 0.2734375,
644
+ "learning_rate": 3.917179515839839e-07,
645
+ "logits/chosen": -1.522472620010376,
646
+ "logits/rejected": -0.8194535970687866,
647
+ "logps/chosen": -272.1158752441406,
648
+ "logps/rejected": -224.29052734375,
649
+ "loss": 0.6722,
650
+ "rewards/accuracies": 0.949999988079071,
651
+ "rewards/chosen": 0.04334510862827301,
652
+ "rewards/margins": 0.045738715678453445,
653
+ "rewards/margins_max": 0.07282572239637375,
654
+ "rewards/margins_min": 0.01865171454846859,
655
+ "rewards/margins_std": 0.03830680996179581,
656
+ "rewards/rejected": -0.002393609844148159,
657
+ "step": 350
658
+ },
659
+ {
660
+ "epoch": 0.39,
661
+ "grad_norm": 0.328125,
662
+ "learning_rate": 3.8386127015561377e-07,
663
+ "logits/chosen": -1.4874508380889893,
664
+ "logits/rejected": -0.9508639574050903,
665
+ "logps/chosen": -264.84063720703125,
666
+ "logps/rejected": -285.2606506347656,
667
+ "loss": 0.6679,
668
+ "rewards/accuracies": 0.8999999761581421,
669
+ "rewards/chosen": 0.038348861038684845,
670
+ "rewards/margins": 0.05200430005788803,
671
+ "rewards/margins_max": 0.07226666063070297,
672
+ "rewards/margins_min": 0.03174193948507309,
673
+ "rewards/margins_std": 0.028655309230089188,
674
+ "rewards/rejected": -0.013655440881848335,
675
+ "step": 360
676
+ },
677
+ {
678
+ "epoch": 0.4,
679
+ "grad_norm": 0.310546875,
680
+ "learning_rate": 3.758146684955368e-07,
681
+ "logits/chosen": -1.4181379079818726,
682
+ "logits/rejected": -0.9576012492179871,
683
+ "logps/chosen": -246.5336151123047,
684
+ "logps/rejected": -277.94232177734375,
685
+ "loss": 0.6709,
686
+ "rewards/accuracies": 0.925000011920929,
687
+ "rewards/chosen": 0.03755969554185867,
688
+ "rewards/margins": 0.04255872964859009,
689
+ "rewards/margins_max": 0.06353293359279633,
690
+ "rewards/margins_min": 0.021584514528512955,
691
+ "rewards/margins_std": 0.029662013053894043,
692
+ "rewards/rejected": -0.00499903317540884,
693
+ "step": 370
694
+ },
695
+ {
696
+ "epoch": 0.41,
697
+ "grad_norm": 0.349609375,
698
+ "learning_rate": 3.6758956299364643e-07,
699
+ "logits/chosen": -1.6038116216659546,
700
+ "logits/rejected": -1.0424953699111938,
701
+ "logps/chosen": -232.23916625976562,
702
+ "logps/rejected": -258.908203125,
703
+ "loss": 0.6694,
704
+ "rewards/accuracies": 0.9750000238418579,
705
+ "rewards/chosen": 0.044283345341682434,
706
+ "rewards/margins": 0.04850899800658226,
707
+ "rewards/margins_max": 0.07648013532161713,
708
+ "rewards/margins_min": 0.020537864416837692,
709
+ "rewards/margins_std": 0.039557162672281265,
710
+ "rewards/rejected": -0.004225648939609528,
711
+ "step": 380
712
+ },
713
+ {
714
+ "epoch": 0.42,
715
+ "grad_norm": 0.318359375,
716
+ "learning_rate": 3.591976232982355e-07,
717
+ "logits/chosen": -1.4831396341323853,
718
+ "logits/rejected": -0.7363861203193665,
719
+ "logps/chosen": -274.75311279296875,
720
+ "logps/rejected": -217.2397918701172,
721
+ "loss": 0.6673,
722
+ "rewards/accuracies": 0.949999988079071,
723
+ "rewards/chosen": 0.052159328013658524,
724
+ "rewards/margins": 0.05904749035835266,
725
+ "rewards/margins_max": 0.08474520593881607,
726
+ "rewards/margins_min": 0.03334975615143776,
727
+ "rewards/margins_std": 0.0363420732319355,
728
+ "rewards/rejected": -0.006888158619403839,
729
+ "step": 390
730
+ },
731
+ {
732
+ "epoch": 0.43,
733
+ "grad_norm": 0.3046875,
734
+ "learning_rate": 3.506507557592853e-07,
735
+ "logits/chosen": -1.4179537296295166,
736
+ "logits/rejected": -0.8838945627212524,
737
+ "logps/chosen": -329.03863525390625,
738
+ "logps/rejected": -292.02008056640625,
739
+ "loss": 0.6655,
740
+ "rewards/accuracies": 0.949999988079071,
741
+ "rewards/chosen": 0.044211916625499725,
742
+ "rewards/margins": 0.057223010808229446,
743
+ "rewards/margins_max": 0.08557866513729095,
744
+ "rewards/margins_min": 0.028867345303297043,
745
+ "rewards/margins_std": 0.04010096564888954,
746
+ "rewards/rejected": -0.013011088594794273,
747
+ "step": 400
748
+ },
749
+ {
750
+ "epoch": 0.44,
751
+ "grad_norm": 0.4375,
752
+ "learning_rate": 3.419610865359266e-07,
753
+ "logits/chosen": -1.4828202724456787,
754
+ "logits/rejected": -0.8764745593070984,
755
+ "logps/chosen": -280.54998779296875,
756
+ "logps/rejected": -276.4558410644531,
757
+ "loss": 0.6682,
758
+ "rewards/accuracies": 0.9750000238418579,
759
+ "rewards/chosen": 0.05212799459695816,
760
+ "rewards/margins": 0.05141522362828255,
761
+ "rewards/margins_max": 0.08034422993659973,
762
+ "rewards/margins_min": 0.022486215457320213,
763
+ "rewards/margins_std": 0.04091179370880127,
764
+ "rewards/rejected": 0.0007127688149921596,
765
+ "step": 410
766
+ },
767
+ {
768
+ "epoch": 0.45,
769
+ "grad_norm": 0.33984375,
770
+ "learning_rate": 3.33140944392039e-07,
771
+ "logits/chosen": -1.2407147884368896,
772
+ "logits/rejected": -0.8896020650863647,
773
+ "logps/chosen": -235.588134765625,
774
+ "logps/rejected": -242.38858032226562,
775
+ "loss": 0.6687,
776
+ "rewards/accuracies": 0.9750000238418579,
777
+ "rewards/chosen": 0.03107466734945774,
778
+ "rewards/margins": 0.04102586954832077,
779
+ "rewards/margins_max": 0.0646006390452385,
780
+ "rewards/margins_min": 0.017451094463467598,
781
+ "rewards/margins_std": 0.033339761197566986,
782
+ "rewards/rejected": -0.00995120219886303,
783
+ "step": 420
784
+ },
785
+ {
786
+ "epoch": 0.46,
787
+ "grad_norm": 0.3203125,
788
+ "learning_rate": 3.2420284320439736e-07,
789
+ "logits/chosen": -1.5982177257537842,
790
+ "logits/rejected": -0.8901177644729614,
791
+ "logps/chosen": -235.1398468017578,
792
+ "logps/rejected": -225.75119018554688,
793
+ "loss": 0.667,
794
+ "rewards/accuracies": 0.949999988079071,
795
+ "rewards/chosen": 0.05144185945391655,
796
+ "rewards/margins": 0.0693436712026596,
797
+ "rewards/margins_max": 0.09611108899116516,
798
+ "rewards/margins_min": 0.04257623478770256,
799
+ "rewards/margins_std": 0.03785485774278641,
800
+ "rewards/rejected": -0.01790180616080761,
801
+ "step": 430
802
+ },
803
+ {
804
+ "epoch": 0.47,
805
+ "grad_norm": 0.291015625,
806
+ "learning_rate": 3.151594642081834e-07,
807
+ "logits/chosen": -1.5106613636016846,
808
+ "logits/rejected": -0.9530634880065918,
809
+ "logps/chosen": -259.29364013671875,
810
+ "logps/rejected": -263.5410461425781,
811
+ "loss": 0.6681,
812
+ "rewards/accuracies": 0.925000011920929,
813
+ "rewards/chosen": 0.054123084992170334,
814
+ "rewards/margins": 0.0667182207107544,
815
+ "rewards/margins_max": 0.09307887405157089,
816
+ "rewards/margins_min": 0.0403575673699379,
817
+ "rewards/margins_std": 0.03727959841489792,
818
+ "rewards/rejected": -0.012595141306519508,
819
+ "step": 440
820
+ },
821
+ {
822
+ "epoch": 0.49,
823
+ "grad_norm": 0.341796875,
824
+ "learning_rate": 3.060236380050519e-07,
825
+ "logits/chosen": -1.5215215682983398,
826
+ "logits/rejected": -0.915096640586853,
827
+ "logps/chosen": -241.9713897705078,
828
+ "logps/rejected": -212.6835174560547,
829
+ "loss": 0.6648,
830
+ "rewards/accuracies": 0.925000011920929,
831
+ "rewards/chosen": 0.041197773069143295,
832
+ "rewards/margins": 0.06050584465265274,
833
+ "rewards/margins_max": 0.09923966228961945,
834
+ "rewards/margins_min": 0.021772030740976334,
835
+ "rewards/margins_std": 0.05477788299322128,
836
+ "rewards/rejected": -0.019308075308799744,
837
+ "step": 450
838
+ },
839
+ {
840
+ "epoch": 0.5,
841
+ "grad_norm": 0.369140625,
842
+ "learning_rate": 2.968083263592782e-07,
843
+ "logits/chosen": -1.429099202156067,
844
+ "logits/rejected": -0.9603363275527954,
845
+ "logps/chosen": -226.94985961914062,
846
+ "logps/rejected": -231.18212890625,
847
+ "loss": 0.6669,
848
+ "rewards/accuracies": 0.949999988079071,
849
+ "rewards/chosen": 0.039319347590208054,
850
+ "rewards/margins": 0.0540069118142128,
851
+ "rewards/margins_max": 0.08096911013126373,
852
+ "rewards/margins_min": 0.027044707909226418,
853
+ "rewards/margins_std": 0.03813030570745468,
854
+ "rewards/rejected": -0.014687557704746723,
855
+ "step": 460
856
+ },
857
+ {
858
+ "epoch": 0.51,
859
+ "grad_norm": 0.3125,
860
+ "learning_rate": 2.875266038078136e-07,
861
+ "logits/chosen": -1.467827320098877,
862
+ "logits/rejected": -0.8274309039115906,
863
+ "logps/chosen": -262.02984619140625,
864
+ "logps/rejected": -258.05291748046875,
865
+ "loss": 0.6663,
866
+ "rewards/accuracies": 0.9750000238418579,
867
+ "rewards/chosen": 0.047955792397260666,
868
+ "rewards/margins": 0.05926694720983505,
869
+ "rewards/margins_max": 0.08313200622797012,
870
+ "rewards/margins_min": 0.035401880741119385,
871
+ "rewards/margins_std": 0.033750299364328384,
872
+ "rewards/rejected": -0.011311152949929237,
873
+ "step": 470
874
+ },
875
+ {
876
+ "epoch": 0.52,
877
+ "grad_norm": 0.330078125,
878
+ "learning_rate": 2.781916391103417e-07,
879
+ "logits/chosen": -1.4126121997833252,
880
+ "logits/rejected": -1.062652826309204,
881
+ "logps/chosen": -312.4531555175781,
882
+ "logps/rejected": -325.8970642089844,
883
+ "loss": 0.6669,
884
+ "rewards/accuracies": 0.8999999761581421,
885
+ "rewards/chosen": 0.046898700296878815,
886
+ "rewards/margins": 0.05282552167773247,
887
+ "rewards/margins_max": 0.07998780906200409,
888
+ "rewards/margins_min": 0.025663232430815697,
889
+ "rewards/margins_std": 0.03841327875852585,
890
+ "rewards/rejected": -0.005926821380853653,
891
+ "step": 480
892
+ },
893
+ {
894
+ "epoch": 0.53,
895
+ "grad_norm": 0.3359375,
896
+ "learning_rate": 2.6881667656565226e-07,
897
+ "logits/chosen": -1.4687728881835938,
898
+ "logits/rejected": -0.979697048664093,
899
+ "logps/chosen": -241.0842742919922,
900
+ "logps/rejected": -232.9574432373047,
901
+ "loss": 0.6659,
902
+ "rewards/accuracies": 0.949999988079071,
903
+ "rewards/chosen": 0.054931215941905975,
904
+ "rewards/margins": 0.0482785627245903,
905
+ "rewards/margins_max": 0.0750807598233223,
906
+ "rewards/margins_min": 0.02147636190056801,
907
+ "rewards/margins_std": 0.037904031574726105,
908
+ "rewards/rejected": 0.006652662996202707,
909
+ "step": 490
910
+ },
911
+ {
912
+ "epoch": 0.54,
913
+ "grad_norm": 0.28515625,
914
+ "learning_rate": 2.594150172208416e-07,
915
+ "logits/chosen": -1.5185790061950684,
916
+ "logits/rejected": -0.9422761797904968,
917
+ "logps/chosen": -234.38119506835938,
918
+ "logps/rejected": -252.7394561767578,
919
+ "loss": 0.669,
920
+ "rewards/accuracies": 0.949999988079071,
921
+ "rewards/chosen": 0.04691288247704506,
922
+ "rewards/margins": 0.05338384583592415,
923
+ "rewards/margins_max": 0.08102253079414368,
924
+ "rewards/margins_min": 0.025745173916220665,
925
+ "rewards/margins_std": 0.03908699378371239,
926
+ "rewards/rejected": -0.006470963358879089,
927
+ "step": 500
928
+ },
929
+ {
930
+ "epoch": 0.55,
931
+ "grad_norm": 0.302734375,
932
+ "learning_rate": 2.5e-07,
933
+ "logits/chosen": -1.4481227397918701,
934
+ "logits/rejected": -1.0346171855926514,
935
+ "logps/chosen": -220.05325317382812,
936
+ "logps/rejected": -236.8842315673828,
937
+ "loss": 0.6634,
938
+ "rewards/accuracies": 0.949999988079071,
939
+ "rewards/chosen": 0.044769201427698135,
940
+ "rewards/margins": 0.05242365598678589,
941
+ "rewards/margins_max": 0.08209005743265152,
942
+ "rewards/margins_min": 0.022757260128855705,
943
+ "rewards/margins_std": 0.04195461794734001,
944
+ "rewards/rejected": -0.007654457353055477,
945
+ "step": 510
946
+ },
947
+ {
948
+ "epoch": 0.56,
949
+ "grad_norm": 0.31640625,
950
+ "learning_rate": 2.405849827791583e-07,
951
+ "logits/chosen": -1.3833669424057007,
952
+ "logits/rejected": -0.881574273109436,
953
+ "logps/chosen": -241.27481079101562,
954
+ "logps/rejected": -263.91351318359375,
955
+ "loss": 0.6641,
956
+ "rewards/accuracies": 1.0,
957
+ "rewards/chosen": 0.05363880842924118,
958
+ "rewards/margins": 0.065273717045784,
959
+ "rewards/margins_max": 0.09218905121088028,
960
+ "rewards/margins_min": 0.03835836052894592,
961
+ "rewards/margins_std": 0.03806404396891594,
962
+ "rewards/rejected": -0.011634895578026772,
963
+ "step": 520
964
+ },
965
+ {
966
+ "epoch": 0.57,
967
+ "grad_norm": 0.3671875,
968
+ "learning_rate": 2.3118332343434777e-07,
969
+ "logits/chosen": -1.5236294269561768,
970
+ "logits/rejected": -0.975500762462616,
971
+ "logps/chosen": -249.5443572998047,
972
+ "logps/rejected": -252.99618530273438,
973
+ "loss": 0.6649,
974
+ "rewards/accuracies": 0.9750000238418579,
975
+ "rewards/chosen": 0.057201676070690155,
976
+ "rewards/margins": 0.06602860987186432,
977
+ "rewards/margins_max": 0.08838556706905365,
978
+ "rewards/margins_min": 0.04367166385054588,
979
+ "rewards/margins_std": 0.03161751106381416,
980
+ "rewards/rejected": -0.00882694311439991,
981
+ "step": 530
982
+ },
983
+ {
984
+ "epoch": 0.58,
985
+ "grad_norm": 0.2734375,
986
+ "learning_rate": 2.218083608896583e-07,
987
+ "logits/chosen": -1.4163635969161987,
988
+ "logits/rejected": -1.021444320678711,
989
+ "logps/chosen": -238.6166534423828,
990
+ "logps/rejected": -231.279296875,
991
+ "loss": 0.6678,
992
+ "rewards/accuracies": 0.8500000238418579,
993
+ "rewards/chosen": 0.04542135074734688,
994
+ "rewards/margins": 0.04465199261903763,
995
+ "rewards/margins_max": 0.07032604515552521,
996
+ "rewards/margins_min": 0.018977930769324303,
997
+ "rewards/margins_std": 0.03630860149860382,
998
+ "rewards/rejected": 0.000769357371609658,
999
+ "step": 540
1000
+ },
1001
+ {
1002
+ "epoch": 0.59,
1003
+ "grad_norm": 0.298828125,
1004
+ "learning_rate": 2.1247339619218638e-07,
1005
+ "logits/chosen": -1.5570924282073975,
1006
+ "logits/rejected": -0.9521455764770508,
1007
+ "logps/chosen": -244.45877075195312,
1008
+ "logps/rejected": -218.03067016601562,
1009
+ "loss": 0.6633,
1010
+ "rewards/accuracies": 0.875,
1011
+ "rewards/chosen": 0.04918726533651352,
1012
+ "rewards/margins": 0.05797078087925911,
1013
+ "rewards/margins_max": 0.07724090665578842,
1014
+ "rewards/margins_min": 0.0387006476521492,
1015
+ "rewards/margins_std": 0.027252081781625748,
1016
+ "rewards/rejected": -0.008783518336713314,
1017
+ "step": 550
1018
+ },
1019
+ {
1020
+ "epoch": 0.6,
1021
+ "grad_norm": 0.369140625,
1022
+ "learning_rate": 2.031916736407218e-07,
1023
+ "logits/chosen": -1.246797800064087,
1024
+ "logits/rejected": -0.8754084706306458,
1025
+ "logps/chosen": -255.4636688232422,
1026
+ "logps/rejected": -206.0786590576172,
1027
+ "loss": 0.6682,
1028
+ "rewards/accuracies": 0.949999988079071,
1029
+ "rewards/chosen": 0.04116806760430336,
1030
+ "rewards/margins": 0.05079611390829086,
1031
+ "rewards/margins_max": 0.07820285856723785,
1032
+ "rewards/margins_min": 0.023389369249343872,
1033
+ "rewards/margins_std": 0.03875899314880371,
1034
+ "rewards/rejected": -0.009628048166632652,
1035
+ "step": 560
1036
+ },
1037
+ {
1038
+ "epoch": 0.61,
1039
+ "grad_norm": 0.3125,
1040
+ "learning_rate": 1.9397636199494806e-07,
1041
+ "logits/chosen": -1.3417741060256958,
1042
+ "logits/rejected": -0.9931136965751648,
1043
+ "logps/chosen": -245.11184692382812,
1044
+ "logps/rejected": -274.5023498535156,
1045
+ "loss": 0.667,
1046
+ "rewards/accuracies": 0.925000011920929,
1047
+ "rewards/chosen": 0.04023710638284683,
1048
+ "rewards/margins": 0.060276590287685394,
1049
+ "rewards/margins_max": 0.08814635127782822,
1050
+ "rewards/margins_min": 0.03240684047341347,
1051
+ "rewards/margins_std": 0.03941378742456436,
1052
+ "rewards/rejected": -0.02003948949277401,
1053
+ "step": 570
1054
+ },
1055
+ {
1056
+ "epoch": 0.63,
1057
+ "grad_norm": 0.310546875,
1058
+ "learning_rate": 1.8484053579181658e-07,
1059
+ "logits/chosen": -1.4128963947296143,
1060
+ "logits/rejected": -0.9094281196594238,
1061
+ "logps/chosen": -241.9584503173828,
1062
+ "logps/rejected": -255.27676391601562,
1063
+ "loss": 0.6669,
1064
+ "rewards/accuracies": 0.9750000238418579,
1065
+ "rewards/chosen": 0.04914793372154236,
1066
+ "rewards/margins": 0.06206917762756348,
1067
+ "rewards/margins_max": 0.09373507648706436,
1068
+ "rewards/margins_min": 0.030403289943933487,
1069
+ "rewards/margins_std": 0.044782333076000214,
1070
+ "rewards/rejected": -0.012921245768666267,
1071
+ "step": 580
1072
+ },
1073
+ {
1074
+ "epoch": 0.64,
1075
+ "grad_norm": 0.3828125,
1076
+ "learning_rate": 1.757971567956027e-07,
1077
+ "logits/chosen": -1.7144603729248047,
1078
+ "logits/rejected": -0.9307141304016113,
1079
+ "logps/chosen": -272.17547607421875,
1080
+ "logps/rejected": -241.65670776367188,
1081
+ "loss": 0.6637,
1082
+ "rewards/accuracies": 0.949999988079071,
1083
+ "rewards/chosen": 0.05295403674244881,
1084
+ "rewards/margins": 0.05789683386683464,
1085
+ "rewards/margins_max": 0.08770729601383209,
1086
+ "rewards/margins_min": 0.028086364269256592,
1087
+ "rewards/margins_std": 0.04215836524963379,
1088
+ "rewards/rejected": -0.00494279433041811,
1089
+ "step": 590
1090
+ },
1091
+ {
1092
+ "epoch": 0.65,
1093
+ "grad_norm": 0.2890625,
1094
+ "learning_rate": 1.6685905560796098e-07,
1095
+ "logits/chosen": -1.3933067321777344,
1096
+ "logits/rejected": -0.8825523257255554,
1097
+ "logps/chosen": -218.8330078125,
1098
+ "logps/rejected": -243.945556640625,
1099
+ "loss": 0.6669,
1100
+ "rewards/accuracies": 0.925000011920929,
1101
+ "rewards/chosen": 0.044868241995573044,
1102
+ "rewards/margins": 0.04994089528918266,
1103
+ "rewards/margins_max": 0.07978564500808716,
1104
+ "rewards/margins_min": 0.020096149295568466,
1105
+ "rewards/margins_std": 0.04220684990286827,
1106
+ "rewards/rejected": -0.005072650499641895,
1107
+ "step": 600
1108
+ },
1109
+ {
1110
+ "epoch": 0.66,
1111
+ "grad_norm": 0.384765625,
1112
+ "learning_rate": 1.580389134640734e-07,
1113
+ "logits/chosen": -1.4454293251037598,
1114
+ "logits/rejected": -1.0624114274978638,
1115
+ "logps/chosen": -232.04562377929688,
1116
+ "logps/rejected": -241.92013549804688,
1117
+ "loss": 0.6626,
1118
+ "rewards/accuracies": 0.9750000238418579,
1119
+ "rewards/chosen": 0.048992957919836044,
1120
+ "rewards/margins": 0.06253460794687271,
1121
+ "rewards/margins_max": 0.09021260589361191,
1122
+ "rewards/margins_min": 0.034856610000133514,
1123
+ "rewards/margins_std": 0.03914260491728783,
1124
+ "rewards/rejected": -0.013541650958359241,
1125
+ "step": 610
1126
+ },
1127
+ {
1128
+ "epoch": 0.67,
1129
+ "grad_norm": 0.41796875,
1130
+ "learning_rate": 1.4934924424071475e-07,
1131
+ "logits/chosen": -1.5101556777954102,
1132
+ "logits/rejected": -0.8701263666152954,
1133
+ "logps/chosen": -268.46563720703125,
1134
+ "logps/rejected": -246.90725708007812,
1135
+ "loss": 0.6673,
1136
+ "rewards/accuracies": 0.925000011920929,
1137
+ "rewards/chosen": 0.047949280589818954,
1138
+ "rewards/margins": 0.05510631948709488,
1139
+ "rewards/margins_max": 0.08622908592224121,
1140
+ "rewards/margins_min": 0.023983558639883995,
1141
+ "rewards/margins_std": 0.044014234095811844,
1142
+ "rewards/rejected": -0.00715703796595335,
1143
+ "step": 620
1144
+ },
1145
+ {
1146
+ "epoch": 0.68,
1147
+ "grad_norm": 0.345703125,
1148
+ "learning_rate": 1.4080237670176453e-07,
1149
+ "logits/chosen": -1.5899397134780884,
1150
+ "logits/rejected": -0.9542080163955688,
1151
+ "logps/chosen": -250.47073364257812,
1152
+ "logps/rejected": -213.67672729492188,
1153
+ "loss": 0.6642,
1154
+ "rewards/accuracies": 0.949999988079071,
1155
+ "rewards/chosen": 0.048125170171260834,
1156
+ "rewards/margins": 0.058349937200546265,
1157
+ "rewards/margins_max": 0.08438356220722198,
1158
+ "rewards/margins_min": 0.03231631591916084,
1159
+ "rewards/margins_std": 0.03681711107492447,
1160
+ "rewards/rejected": -0.010224771685898304,
1161
+ "step": 630
1162
+ },
1163
+ {
1164
+ "epoch": 0.69,
1165
+ "grad_norm": 0.32421875,
1166
+ "learning_rate": 1.3241043700635352e-07,
1167
+ "logits/chosen": -1.4617611169815063,
1168
+ "logits/rejected": -0.715996265411377,
1169
+ "logps/chosen": -311.3377990722656,
1170
+ "logps/rejected": -235.23257446289062,
1171
+ "loss": 0.6615,
1172
+ "rewards/accuracies": 0.949999988079071,
1173
+ "rewards/chosen": 0.049124039709568024,
1174
+ "rewards/margins": 0.06973692029714584,
1175
+ "rewards/margins_max": 0.09559544920921326,
1176
+ "rewards/margins_min": 0.04387838765978813,
1177
+ "rewards/margins_std": 0.03656948357820511,
1178
+ "rewards/rejected": -0.02061288245022297,
1179
+ "step": 640
1180
+ },
1181
+ {
1182
+ "epoch": 0.7,
1183
+ "grad_norm": 0.341796875,
1184
+ "learning_rate": 1.2418533150446324e-07,
1185
+ "logits/chosen": -1.5678064823150635,
1186
+ "logits/rejected": -0.8574711680412292,
1187
+ "logps/chosen": -270.28997802734375,
1188
+ "logps/rejected": -229.6758575439453,
1189
+ "loss": 0.6643,
1190
+ "rewards/accuracies": 1.0,
1191
+ "rewards/chosen": 0.049732841551303864,
1192
+ "rewards/margins": 0.06726487725973129,
1193
+ "rewards/margins_max": 0.09402552247047424,
1194
+ "rewards/margins_min": 0.04050421714782715,
1195
+ "rewards/margins_std": 0.03784528002142906,
1196
+ "rewards/rejected": -0.01753203384578228,
1197
+ "step": 650
1198
+ },
1199
+ {
1200
+ "epoch": 0.71,
1201
+ "grad_norm": 0.322265625,
1202
+ "learning_rate": 1.1613872984438628e-07,
1203
+ "logits/chosen": -1.5800104141235352,
1204
+ "logits/rejected": -0.9640630483627319,
1205
+ "logps/chosen": -217.96762084960938,
1206
+ "logps/rejected": -209.03237915039062,
1207
+ "loss": 0.6664,
1208
+ "rewards/accuracies": 0.9750000238418579,
1209
+ "rewards/chosen": 0.05414942651987076,
1210
+ "rewards/margins": 0.05594904348254204,
1211
+ "rewards/margins_max": 0.08529958873987198,
1212
+ "rewards/margins_min": 0.026598507538437843,
1213
+ "rewards/margins_std": 0.04150792956352234,
1214
+ "rewards/rejected": -0.0017996244132518768,
1215
+ "step": 660
1216
+ },
1217
+ {
1218
+ "epoch": 0.72,
1219
+ "grad_norm": 0.412109375,
1220
+ "learning_rate": 1.0828204841601607e-07,
1221
+ "logits/chosen": -1.573081612586975,
1222
+ "logits/rejected": -1.0643428564071655,
1223
+ "logps/chosen": -267.5829772949219,
1224
+ "logps/rejected": -279.0192565917969,
1225
+ "loss": 0.6654,
1226
+ "rewards/accuracies": 1.0,
1227
+ "rewards/chosen": 0.05038607865571976,
1228
+ "rewards/margins": 0.059136874973773956,
1229
+ "rewards/margins_max": 0.08540179580450058,
1230
+ "rewards/margins_min": 0.032871946692466736,
1231
+ "rewards/margins_std": 0.037144217640161514,
1232
+ "rewards/rejected": -0.00875079445540905,
1233
+ "step": 670
1234
+ },
1235
+ {
1236
+ "epoch": 0.73,
1237
+ "grad_norm": 0.3046875,
1238
+ "learning_rate": 1.0062643415345545e-07,
1239
+ "logits/chosen": -1.5550715923309326,
1240
+ "logits/rejected": -0.911180853843689,
1241
+ "logps/chosen": -234.7667999267578,
1242
+ "logps/rejected": -255.4190673828125,
1243
+ "loss": 0.6641,
1244
+ "rewards/accuracies": 0.9750000238418579,
1245
+ "rewards/chosen": 0.051492560654878616,
1246
+ "rewards/margins": 0.06578966230154037,
1247
+ "rewards/margins_max": 0.09979908168315887,
1248
+ "rewards/margins_min": 0.03178024664521217,
1249
+ "rewards/margins_std": 0.04809657856822014,
1250
+ "rewards/rejected": -0.014297107234597206,
1251
+ "step": 680
1252
+ },
1253
+ {
1254
+ "epoch": 0.74,
1255
+ "grad_norm": 0.287109375,
1256
+ "learning_rate": 9.318274871992407e-08,
1257
+ "logits/chosen": -1.5272128582000732,
1258
+ "logits/rejected": -1.013564944267273,
1259
+ "logps/chosen": -241.36782836914062,
1260
+ "logps/rejected": -224.7821044921875,
1261
+ "loss": 0.6665,
1262
+ "rewards/accuracies": 0.949999988079071,
1263
+ "rewards/chosen": 0.04677369445562363,
1264
+ "rewards/margins": 0.05521191284060478,
1265
+ "rewards/margins_max": 0.08911783993244171,
1266
+ "rewards/margins_min": 0.021305980160832405,
1267
+ "rewards/margins_std": 0.04795023053884506,
1268
+ "rewards/rejected": -0.008438214659690857,
1269
+ "step": 690
1270
+ },
1271
+ {
1272
+ "epoch": 0.76,
1273
+ "grad_norm": 0.314453125,
1274
+ "learning_rate": 8.596155309740469e-08,
1275
+ "logits/chosen": -1.6033340692520142,
1276
+ "logits/rejected": -1.09574294090271,
1277
+ "logps/chosen": -246.16799926757812,
1278
+ "logps/rejected": -263.65142822265625,
1279
+ "loss": 0.6648,
1280
+ "rewards/accuracies": 0.949999988079071,
1281
+ "rewards/chosen": 0.0439617782831192,
1282
+ "rewards/margins": 0.054900698363780975,
1283
+ "rewards/margins_max": 0.08236662298440933,
1284
+ "rewards/margins_min": 0.02743479050695896,
1285
+ "rewards/margins_std": 0.03884267061948776,
1286
+ "rewards/rejected": -0.010938925668597221,
1287
+ "step": 700
1288
+ },
1289
+ {
1290
+ "epoch": 0.77,
1291
+ "grad_norm": 0.267578125,
1292
+ "learning_rate": 7.897309260289026e-08,
1293
+ "logits/chosen": -1.5482122898101807,
1294
+ "logits/rejected": -1.1510074138641357,
1295
+ "logps/chosen": -249.0963897705078,
1296
+ "logps/rejected": -259.163818359375,
1297
+ "loss": 0.6635,
1298
+ "rewards/accuracies": 0.949999988079071,
1299
+ "rewards/chosen": 0.04723275452852249,
1300
+ "rewards/margins": 0.05302921682596207,
1301
+ "rewards/margins_max": 0.0721876472234726,
1302
+ "rewards/margins_min": 0.03387077525258064,
1303
+ "rewards/margins_std": 0.027094120159745216,
1304
+ "rewards/rejected": -0.005796459037810564,
1305
+ "step": 710
1306
+ },
1307
+ {
1308
+ "epoch": 0.78,
1309
+ "grad_norm": 0.373046875,
1310
+ "learning_rate": 7.222728235249195e-08,
1311
+ "logits/chosen": -1.3384692668914795,
1312
+ "logits/rejected": -0.7518659830093384,
1313
+ "logps/chosen": -202.97393798828125,
1314
+ "logps/rejected": -189.1539764404297,
1315
+ "loss": 0.6669,
1316
+ "rewards/accuracies": 0.925000011920929,
1317
+ "rewards/chosen": 0.03645472601056099,
1318
+ "rewards/margins": 0.05072823911905289,
1319
+ "rewards/margins_max": 0.0754196047782898,
1320
+ "rewards/margins_min": 0.026036862283945084,
1321
+ "rewards/margins_std": 0.03491886705160141,
1322
+ "rewards/rejected": -0.014273506589233875,
1323
+ "step": 720
1324
+ },
1325
+ {
1326
+ "epoch": 0.79,
1327
+ "grad_norm": 0.28125,
1328
+ "learning_rate": 6.573369319403108e-08,
1329
+ "logits/chosen": -1.5845191478729248,
1330
+ "logits/rejected": -0.9599083065986633,
1331
+ "logps/chosen": -228.37417602539062,
1332
+ "logps/rejected": -237.9397430419922,
1333
+ "loss": 0.6648,
1334
+ "rewards/accuracies": 0.9750000238418579,
1335
+ "rewards/chosen": 0.05403770133852959,
1336
+ "rewards/margins": 0.06774094700813293,
1337
+ "rewards/margins_max": 0.09628431499004364,
1338
+ "rewards/margins_min": 0.03919757157564163,
1339
+ "rewards/margins_std": 0.04036641865968704,
1340
+ "rewards/rejected": -0.013703237287700176,
1341
+ "step": 730
1342
+ },
1343
+ {
1344
+ "epoch": 0.8,
1345
+ "grad_norm": 0.357421875,
1346
+ "learning_rate": 5.9501538128072597e-08,
1347
+ "logits/chosen": -1.5676336288452148,
1348
+ "logits/rejected": -0.8547343015670776,
1349
+ "logps/chosen": -290.7181701660156,
1350
+ "logps/rejected": -234.4829559326172,
1351
+ "loss": 0.6666,
1352
+ "rewards/accuracies": 0.925000011920929,
1353
+ "rewards/chosen": 0.046895284205675125,
1354
+ "rewards/margins": 0.059913188219070435,
1355
+ "rewards/margins_max": 0.08948854357004166,
1356
+ "rewards/margins_min": 0.030337834730744362,
1357
+ "rewards/margins_std": 0.04182586818933487,
1358
+ "rewards/rejected": -0.01301790215075016,
1359
+ "step": 740
1360
+ },
1361
+ {
1362
+ "epoch": 0.81,
1363
+ "grad_norm": 0.34765625,
1364
+ "learning_rate": 5.353965923666742e-08,
1365
+ "logits/chosen": -1.3945667743682861,
1366
+ "logits/rejected": -0.8627855181694031,
1367
+ "logps/chosen": -309.19976806640625,
1368
+ "logps/rejected": -313.39263916015625,
1369
+ "loss": 0.6653,
1370
+ "rewards/accuracies": 1.0,
1371
+ "rewards/chosen": 0.052454281598329544,
1372
+ "rewards/margins": 0.06173131614923477,
1373
+ "rewards/margins_max": 0.08643205463886261,
1374
+ "rewards/margins_min": 0.037030577659606934,
1375
+ "rewards/margins_std": 0.03493211418390274,
1376
+ "rewards/rejected": -0.009277036413550377,
1377
+ "step": 750
1378
+ },
1379
+ {
1380
+ "epoch": 0.82,
1381
+ "grad_norm": 0.3046875,
1382
+ "learning_rate": 4.7856515138347735e-08,
1383
+ "logits/chosen": -1.501372218132019,
1384
+ "logits/rejected": -0.781902015209198,
1385
+ "logps/chosen": -265.1309509277344,
1386
+ "logps/rejected": -230.6505126953125,
1387
+ "loss": 0.6627,
1388
+ "rewards/accuracies": 1.0,
1389
+ "rewards/chosen": 0.05485969036817551,
1390
+ "rewards/margins": 0.0646880641579628,
1391
+ "rewards/margins_max": 0.08704294264316559,
1392
+ "rewards/margins_min": 0.04233316332101822,
1393
+ "rewards/margins_std": 0.03161459416151047,
1394
+ "rewards/rejected": -0.009828361682593822,
1395
+ "step": 760
1396
+ },
1397
+ {
1398
+ "epoch": 0.83,
1399
+ "grad_norm": 0.359375,
1400
+ "learning_rate": 4.2460168987173806e-08,
1401
+ "logits/chosen": -1.595336675643921,
1402
+ "logits/rejected": -0.9258917570114136,
1403
+ "logps/chosen": -295.4404602050781,
1404
+ "logps/rejected": -247.3825225830078,
1405
+ "loss": 0.6617,
1406
+ "rewards/accuracies": 0.949999988079071,
1407
+ "rewards/chosen": 0.056455206125974655,
1408
+ "rewards/margins": 0.0680319294333458,
1409
+ "rewards/margins_max": 0.09642402827739716,
1410
+ "rewards/margins_min": 0.039639830589294434,
1411
+ "rewards/margins_std": 0.040152497589588165,
1412
+ "rewards/rejected": -0.011576727032661438,
1413
+ "step": 770
1414
+ },
1415
+ {
1416
+ "epoch": 0.84,
1417
+ "grad_norm": 0.326171875,
1418
+ "learning_rate": 3.7358277032860016e-08,
1419
+ "logits/chosen": -1.5608649253845215,
1420
+ "logits/rejected": -0.8827853202819824,
1421
+ "logps/chosen": -260.73944091796875,
1422
+ "logps/rejected": -277.09674072265625,
1423
+ "loss": 0.6638,
1424
+ "rewards/accuracies": 1.0,
1425
+ "rewards/chosen": 0.049758292734622955,
1426
+ "rewards/margins": 0.06226016953587532,
1427
+ "rewards/margins_max": 0.08762570470571518,
1428
+ "rewards/margins_min": 0.03689463064074516,
1429
+ "rewards/margins_std": 0.035872288048267365,
1430
+ "rewards/rejected": -0.012501873075962067,
1431
+ "step": 780
1432
+ },
1433
+ {
1434
+ "epoch": 0.85,
1435
+ "grad_norm": 0.251953125,
1436
+ "learning_rate": 3.255807775821015e-08,
1437
+ "logits/chosen": -1.5679352283477783,
1438
+ "logits/rejected": -0.9218677282333374,
1439
+ "logps/chosen": -289.8990783691406,
1440
+ "logps/rejected": -239.71261596679688,
1441
+ "loss": 0.6657,
1442
+ "rewards/accuracies": 1.0,
1443
+ "rewards/chosen": 0.05259973928332329,
1444
+ "rewards/margins": 0.06116511672735214,
1445
+ "rewards/margins_max": 0.08521325886249542,
1446
+ "rewards/margins_min": 0.037116967141628265,
1447
+ "rewards/margins_std": 0.034009214490652084,
1448
+ "rewards/rejected": -0.00856537651270628,
1449
+ "step": 790
1450
+ },
1451
+ {
1452
+ "epoch": 0.86,
1453
+ "grad_norm": 0.306640625,
1454
+ "learning_rate": 2.8066381609273493e-08,
1455
+ "logits/chosen": -1.4278221130371094,
1456
+ "logits/rejected": -0.8260752558708191,
1457
+ "logps/chosen": -247.1492156982422,
1458
+ "logps/rejected": -225.48062133789062,
1459
+ "loss": 0.6659,
1460
+ "rewards/accuracies": 1.0,
1461
+ "rewards/chosen": 0.05115921422839165,
1462
+ "rewards/margins": 0.0625448077917099,
1463
+ "rewards/margins_max": 0.09311069548130035,
1464
+ "rewards/margins_min": 0.031978923827409744,
1465
+ "rewards/margins_std": 0.043226685374975204,
1466
+ "rewards/rejected": -0.011385595425963402,
1467
+ "step": 800
1468
+ },
1469
+ {
1470
+ "epoch": 0.87,
1471
+ "grad_norm": 0.3359375,
1472
+ "learning_rate": 2.3889561332792657e-08,
1473
+ "logits/chosen": -1.5297521352767944,
1474
+ "logits/rejected": -0.9847742319107056,
1475
+ "logps/chosen": -270.2022399902344,
1476
+ "logps/rejected": -246.3189239501953,
1477
+ "loss": 0.6656,
1478
+ "rewards/accuracies": 0.925000011920929,
1479
+ "rewards/chosen": 0.04832616075873375,
1480
+ "rewards/margins": 0.057206034660339355,
1481
+ "rewards/margins_max": 0.08940082043409348,
1482
+ "rewards/margins_min": 0.025011247023940086,
1483
+ "rewards/margins_std": 0.04553030803799629,
1484
+ "rewards/rejected": -0.008879872970283031,
1485
+ "step": 810
1486
+ },
1487
+ {
1488
+ "epoch": 0.88,
1489
+ "grad_norm": 0.3203125,
1490
+ "learning_rate": 2.0033542934652675e-08,
1491
+ "logits/chosen": -1.5510307550430298,
1492
+ "logits/rejected": -0.856239914894104,
1493
+ "logps/chosen": -256.7835388183594,
1494
+ "logps/rejected": -223.23135375976562,
1495
+ "loss": 0.6654,
1496
+ "rewards/accuracies": 0.925000011920929,
1497
+ "rewards/chosen": 0.052466489374637604,
1498
+ "rewards/margins": 0.058548521250486374,
1499
+ "rewards/margins_max": 0.08875375986099243,
1500
+ "rewards/margins_min": 0.028343280777335167,
1501
+ "rewards/margins_std": 0.042716652154922485,
1502
+ "rewards/rejected": -0.006082023028284311,
1503
+ "step": 820
1504
+ },
1505
+ {
1506
+ "epoch": 0.9,
1507
+ "grad_norm": 0.333984375,
1508
+ "learning_rate": 1.6503797272158282e-08,
1509
+ "logits/chosen": -1.4320178031921387,
1510
+ "logits/rejected": -1.0073983669281006,
1511
+ "logps/chosen": -241.7576904296875,
1512
+ "logps/rejected": -277.2112731933594,
1513
+ "loss": 0.6654,
1514
+ "rewards/accuracies": 0.925000011920929,
1515
+ "rewards/chosen": 0.03875169903039932,
1516
+ "rewards/margins": 0.05952931568026543,
1517
+ "rewards/margins_max": 0.09110890328884125,
1518
+ "rewards/margins_min": 0.027949709445238113,
1519
+ "rewards/margins_std": 0.044660307466983795,
1520
+ "rewards/rejected": -0.020777616649866104,
1521
+ "step": 830
1522
+ },
1523
+ {
1524
+ "epoch": 0.91,
1525
+ "grad_norm": 0.291015625,
1526
+ "learning_rate": 1.3305332292068705e-08,
1527
+ "logits/chosen": -1.5706042051315308,
1528
+ "logits/rejected": -1.0899040699005127,
1529
+ "logps/chosen": -276.74017333984375,
1530
+ "logps/rejected": -287.7333068847656,
1531
+ "loss": 0.6645,
1532
+ "rewards/accuracies": 0.925000011920929,
1533
+ "rewards/chosen": 0.05088504031300545,
1534
+ "rewards/margins": 0.0576903410255909,
1535
+ "rewards/margins_max": 0.0889071449637413,
1536
+ "rewards/margins_min": 0.02647354081273079,
1537
+ "rewards/margins_std": 0.04414721950888634,
1538
+ "rewards/rejected": -0.006805300712585449,
1539
+ "step": 840
1540
+ },
1541
+ {
1542
+ "epoch": 0.92,
1543
+ "grad_norm": 0.33203125,
1544
+ "learning_rate": 1.0442685925403344e-08,
1545
+ "logits/chosen": -1.5475889444351196,
1546
+ "logits/rejected": -1.067118525505066,
1547
+ "logps/chosen": -247.9560089111328,
1548
+ "logps/rejected": -261.21209716796875,
1549
+ "loss": 0.6645,
1550
+ "rewards/accuracies": 0.925000011920929,
1551
+ "rewards/chosen": 0.054555535316467285,
1552
+ "rewards/margins": 0.06061048060655594,
1553
+ "rewards/margins_max": 0.09440271556377411,
1554
+ "rewards/margins_min": 0.026818236336112022,
1555
+ "rewards/margins_std": 0.04778943955898285,
1556
+ "rewards/rejected": -0.00605494249612093,
1557
+ "step": 850
1558
+ },
1559
+ {
1560
+ "epoch": 0.93,
1561
+ "grad_norm": 0.35546875,
1562
+ "learning_rate": 7.91991964909744e-09,
1563
+ "logits/chosen": -1.5974628925323486,
1564
+ "logits/rejected": -0.9376012682914734,
1565
+ "logps/chosen": -226.9661407470703,
1566
+ "logps/rejected": -215.2276153564453,
1567
+ "loss": 0.6649,
1568
+ "rewards/accuracies": 0.9750000238418579,
1569
+ "rewards/chosen": 0.04701418802142143,
1570
+ "rewards/margins": 0.06498473882675171,
1571
+ "rewards/margins_max": 0.08930108696222305,
1572
+ "rewards/margins_min": 0.040668390691280365,
1573
+ "rewards/margins_std": 0.034388504922389984,
1574
+ "rewards/rejected": -0.017970550805330276,
1575
+ "step": 860
1576
+ },
1577
+ {
1578
+ "epoch": 0.94,
1579
+ "grad_norm": 0.259765625,
1580
+ "learning_rate": 5.740612723643401e-09,
1581
+ "logits/chosen": -1.5247188806533813,
1582
+ "logits/rejected": -1.0012189149856567,
1583
+ "logps/chosen": -215.85086059570312,
1584
+ "logps/rejected": -213.22756958007812,
1585
+ "loss": 0.6651,
1586
+ "rewards/accuracies": 0.925000011920929,
1587
+ "rewards/chosen": 0.03825182095170021,
1588
+ "rewards/margins": 0.05696084350347519,
1589
+ "rewards/margins_max": 0.08913502097129822,
1590
+ "rewards/margins_min": 0.024786660447716713,
1591
+ "rewards/margins_std": 0.045501161366701126,
1592
+ "rewards/rejected": -0.01870902255177498,
1593
+ "step": 870
1594
+ },
1595
+ {
1596
+ "epoch": 0.95,
1597
+ "grad_norm": 0.3203125,
1598
+ "learning_rate": 3.907857114893359e-09,
1599
+ "logits/chosen": -1.526238203048706,
1600
+ "logits/rejected": -1.0228248834609985,
1601
+ "logps/chosen": -250.3122100830078,
1602
+ "logps/rejected": -268.1669616699219,
1603
+ "loss": 0.6645,
1604
+ "rewards/accuracies": 0.949999988079071,
1605
+ "rewards/chosen": 0.05541209131479263,
1606
+ "rewards/margins": 0.06139000505208969,
1607
+ "rewards/margins_max": 0.08599219471216202,
1608
+ "rewards/margins_min": 0.03678782656788826,
1609
+ "rewards/margins_std": 0.03479274362325668,
1610
+ "rewards/rejected": -0.005977921187877655,
1611
+ "step": 880
1612
+ },
1613
+ {
1614
+ "epoch": 0.96,
1615
+ "grad_norm": 0.390625,
1616
+ "learning_rate": 2.4242531072273255e-09,
1617
+ "logits/chosen": -1.6231883764266968,
1618
+ "logits/rejected": -1.0414937734603882,
1619
+ "logps/chosen": -242.6604766845703,
1620
+ "logps/rejected": -248.1051483154297,
1621
+ "loss": 0.6626,
1622
+ "rewards/accuracies": 0.9750000238418579,
1623
+ "rewards/chosen": 0.04938145726919174,
1624
+ "rewards/margins": 0.05951399728655815,
1625
+ "rewards/margins_max": 0.08650043606758118,
1626
+ "rewards/margins_min": 0.03252756968140602,
1627
+ "rewards/margins_std": 0.03816457465291023,
1628
+ "rewards/rejected": -0.01013254001736641,
1629
+ "step": 890
1630
+ },
1631
+ {
1632
+ "epoch": 0.97,
1633
+ "grad_norm": 0.279296875,
1634
+ "learning_rate": 1.2919056143113061e-09,
1635
+ "logits/chosen": -1.6871249675750732,
1636
+ "logits/rejected": -1.0585591793060303,
1637
+ "logps/chosen": -217.6454620361328,
1638
+ "logps/rejected": -223.8054656982422,
1639
+ "loss": 0.6662,
1640
+ "rewards/accuracies": 0.949999988079071,
1641
+ "rewards/chosen": 0.048433009535074234,
1642
+ "rewards/margins": 0.06717512011528015,
1643
+ "rewards/margins_max": 0.09603999555110931,
1644
+ "rewards/margins_min": 0.03831023350358009,
1645
+ "rewards/margins_std": 0.04082110896706581,
1646
+ "rewards/rejected": -0.01874210312962532,
1647
+ "step": 900
1648
+ },
1649
+ {
1650
+ "epoch": 0.98,
1651
+ "grad_norm": 0.326171875,
1652
+ "learning_rate": 5.124211926793575e-10,
1653
+ "logits/chosen": -1.413845419883728,
1654
+ "logits/rejected": -0.7665145993232727,
1655
+ "logps/chosen": -264.106689453125,
1656
+ "logps/rejected": -229.184326171875,
1657
+ "loss": 0.6668,
1658
+ "rewards/accuracies": 1.0,
1659
+ "rewards/chosen": 0.0459374263882637,
1660
+ "rewards/margins": 0.05880744382739067,
1661
+ "rewards/margins_max": 0.08354458957910538,
1662
+ "rewards/margins_min": 0.034070298075675964,
1663
+ "rewards/margins_std": 0.03498360887169838,
1664
+ "rewards/rejected": -0.012870019301772118,
1665
+ "step": 910
1666
+ },
1667
+ {
1668
+ "epoch": 0.99,
1669
+ "grad_norm": 0.341796875,
1670
+ "learning_rate": 8.690576237688207e-11,
1671
+ "logits/chosen": -1.4011876583099365,
1672
+ "logits/rejected": -0.9543458819389343,
1673
+ "logps/chosen": -281.8406677246094,
1674
+ "logps/rejected": -222.5669708251953,
1675
+ "loss": 0.6665,
1676
+ "rewards/accuracies": 0.9750000238418579,
1677
+ "rewards/chosen": 0.045102305710315704,
1678
+ "rewards/margins": 0.05106315761804581,
1679
+ "rewards/margins_max": 0.07114427536725998,
1680
+ "rewards/margins_min": 0.03098202869296074,
1681
+ "rewards/margins_std": 0.028398994356393814,
1682
+ "rewards/rejected": -0.0059608458541333675,
1683
+ "step": 920
1684
+ },
1685
+ {
1686
+ "epoch": 1.0,
1687
+ "eval_logits/chosen": -1.0221611261367798,
1688
+ "eval_logits/rejected": -0.8962497115135193,
1689
+ "eval_logps/chosen": -331.233642578125,
1690
+ "eval_logps/rejected": -327.55169677734375,
1691
+ "eval_loss": 0.6933034658432007,
1692
+ "eval_rewards/accuracies": 0.5360000133514404,
1693
+ "eval_rewards/chosen": 0.026766540482640266,
1694
+ "eval_rewards/margins": 0.0002464489371050149,
1695
+ "eval_rewards/margins_max": 0.06006291136145592,
1696
+ "eval_rewards/margins_min": -0.061876267194747925,
1697
+ "eval_rewards/margins_std": 0.04059867188334465,
1698
+ "eval_rewards/rejected": 0.02652009204030037,
1699
+ "eval_runtime": 750.1106,
1700
+ "eval_samples_per_second": 5.333,
1701
+ "eval_steps_per_second": 0.167,
1702
+ "step": 927
1703
+ },
1704
+ {
1705
+ "epoch": 1.0,
1706
+ "step": 927,
1707
+ "total_flos": 0.0,
1708
+ "train_loss": 0.6720605961327414,
1709
+ "train_runtime": 8514.5547,
1710
+ "train_samples_per_second": 1.741,
1711
+ "train_steps_per_second": 0.109
1712
+ }
1713
+ ],
1714
+ "logging_steps": 10,
1715
+ "max_steps": 927,
1716
+ "num_input_tokens_seen": 0,
1717
+ "num_train_epochs": 1,
1718
+ "save_steps": 100,
1719
+ "total_flos": 0.0,
1720
+ "train_batch_size": 2,
1721
+ "trial_name": null,
1722
+ "trial_params": null
1723
+ }