REBEL: Reinforcement Learning via Regressing Relative Reward
Collection
10 items
•
Updated
•
1
This is a model released for our paper: REBEL: Reinforcement Learning via Regressing Relative Rewards.
This model is developed with REBEL based on OpenChat-3.5 with Starling-RM-7B-alpha as the reward model and Nectar dataset. The training code is available at https://github.com/ZhaolinGao/REBEL. We collect online generations during each iteration with a batch size of 32.
Model | AlpacaEval 2.0 LC Win Rate |
AlpacaEval 2.0 Win Rate |
MT-Bench Average |
MMLU (5-shot) |
GSM8K (5-shot) |
---|---|---|---|---|---|
REBEL-OpenChat-3.5 | 17.3 | 12.8 | 8.06 | 63.7 | 68.8 |
REBEL-Llama-3 | 30.1 | 32.6 | 8.16 | 65.8 | 75.6 |
REBEL-Llama-3-epoch_2 | 31.3 | 34.2 | 7.83 | 65.4 | 75.4 |
REBEL-Llama-3-Armo-iter_1 | 48.3 | 41.8 | 8.13 | 66.3 | 75.8 |
REBEL-Llama-3-Armo-iter_2 | 50.0 | 48.5 | 8.07 | 65.9 | 75.4 |
REBEL-Llama-3-Armo-iter_3 | 49.7 | 48.1 | 8.01 | 66.0 | 75.7 |
Please cite our paper if you use this model in your own work:
@misc{gao2024rebel,
title={REBEL: Reinforcement Learning via Regressing Relative Rewards},
author={Zhaolin Gao and Jonathan D. Chang and Wenhao Zhan and Owen Oertell and Gokul Swamy and Kianté Brantley and Thorsten Joachims and J. Andrew Bagnell and Jason D. Lee and Wen Sun},
year={2024},
eprint={2404.16767},
archivePrefix={arXiv},
primaryClass={cs.LG}
}