|
To use this model, you need to load by `AutoModelForSequenceClassification`, |
|
```python |
|
model = AutoModelForSequenceClassification.from_pretrained( |
|
"hendrydong/Mistral-RM-for-RAFT-GSHF-v0", num_labels=1, torch_dtype=torch.bfloat16 |
|
) |
|
``` |
|
and prepare dataset like |
|
```python |
|
SAMPLE =[ |
|
{'role': 'user', 'content': 'Hi!'}, |
|
{'role': 'assistant', 'content': 'How are you?'}, |
|
] |
|
``` |
|
|
|
The template is the same as `mistralai/Mistral-7B-Instruct-v0.2`. |
|
|
|
The reward model can be used for iterative SFT/DPO. |
|
|
|
Please cite them if you found this RM helpful, |
|
``` |
|
@article{dong2023raft, |
|
title={Raft: Reward ranked finetuning for generative foundation model alignment}, |
|
author={Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong}, |
|
journal={arXiv preprint arXiv:2304.06767}, |
|
year={2023} |
|
} |
|
|
|
@article{xiong2023gibbs, |
|
title={Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf}, |
|
author={Xiong, Wei and Dong, Hanze and Ye, Chenlu and Zhong, Han and Jiang, Nan and Zhang, Tong}, |
|
journal={arXiv preprint arXiv:2312.11456}, |
|
year={2023} |
|
} |
|
``` |