Commit
·
7cad203
1
Parent(s):
35700d2
Update README.md
Browse files
README.md
CHANGED
@@ -12,6 +12,8 @@ In this repo, we present a reward model trained by the framework [LMFlow](https:
|
|
12 |
|
13 |
## Model Details
|
14 |
|
|
|
|
|
15 |
### Dataset preprocessing
|
16 |
|
17 |
<!-- Provide a longer summary of what this model is. -->
|
@@ -34,6 +36,18 @@ We use bf16 and do not use LoRA in both of the stages.
|
|
34 |
|
35 |
**The resulting model achieves an evaluation loss of 0.5 and an evaluation accuracy 75.48%.**
|
36 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
|
38 |
|
39 |
## Uses
|
@@ -67,7 +81,7 @@ We use bf16 and do not use LoRA in both of the stages.
|
|
67 |
|
68 |
### RAFT Example
|
69 |
|
70 |
-
We test the reward model by the
|
71 |
|
72 |
For each iteration, we sample 2048 prompts from the HH-RLHF dataset, and for each prompt, we generate K=8 responses by the current model, and pick the response with the highest reward. Then, we finetune the model on this picked set to get the new model. We report the learning curve as follows:
|
73 |
|
|
|
12 |
|
13 |
## Model Details
|
14 |
|
15 |
+
The training curves and some other details can be found in our paper [RAFT (Reward ranked finetuning)](https://arxiv.org/pdf/2304.06767.pdf). If you have any question with this reward model and also any question about reward modeling, feel free to drop me an email with [email protected]. I would be happy to chat!
|
16 |
+
|
17 |
### Dataset preprocessing
|
18 |
|
19 |
<!-- Provide a longer summary of what this model is. -->
|
|
|
36 |
|
37 |
**The resulting model achieves an evaluation loss of 0.5 and an evaluation accuracy 75.48%.**
|
38 |
|
39 |
+
**Generalization**
|
40 |
+
|
41 |
+
We further test the generalization ability of the reward model but with another round of training during another research project (with the same hyper-parameter though). We test the accuracy on open assistant dataset and chatbot dataset, and compare the reward model to the reward models trained directly on these two datasets. The results are as follows:
|
42 |
+
|
43 |
+
| Dataset training/test | open assistant | chatbot | hh_rlhf |
|
44 |
+
| -------------- | -------------- | ------- | ------- |
|
45 |
+
| open assistant | 69.5 | 61.1 | 58.7 |
|
46 |
+
| chatbot | 66.5 | 62.7 | 56.0 |
|
47 |
+
| hh_rlhf | 69.4 | 64.2 | 77.6 |
|
48 |
+
|
49 |
+
As we can see, the reward model trained on the HH-RLHF achieves matching or even better accuracy on open assistant and chatbot datasets, even though it is not trained on them directly. Therefore, the reward model may also be used for these two datasets.
|
50 |
+
|
51 |
|
52 |
|
53 |
## Uses
|
|
|
81 |
|
82 |
### RAFT Example
|
83 |
|
84 |
+
We test the reward model by the RAFT and with EleutherAI/gpt-neo-2.7B as the starting checkpoint.
|
85 |
|
86 |
For each iteration, we sample 2048 prompts from the HH-RLHF dataset, and for each prompt, we generate K=8 responses by the current model, and pick the response with the highest reward. Then, we finetune the model on this picked set to get the new model. We report the learning curve as follows:
|
87 |
|