|
--- |
|
license: mit |
|
language: |
|
- en |
|
tags: |
|
- ODIN |
|
- RLHF |
|
- PPO |
|
--- |
|
|
|
## Model Details |
|
This is an official implementation of ODIN-ppo-L230-7B model, which is a chat assistant trained by fine-tuning LLaMA on Open-Assistant dataset via PPO. |
|
The L230 means the output length in LIMA test set is ~230. ODIN is the reward model for the training. |
|
|
|
## Model Description |
|
|
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
- **Developed by:** [Lichang-Chen](https://huggingface.co/Lichang-Chen) and [Chen Zhu](https://scholar.google.com/citations?hl=zh-CN&user=m-om5O8AAAAJ) |
|
- **Model type:** RLHF model. |
|
- **Language(s) (NLP):** English |
|
- **Finetuned from model:** [Vicuna-7b](https://huggingface.co/lmsys/vicuna-7b-v1.5) |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [ODIN](https://github.com/Lichang-Chen/ODIN) |
|
- **Paper:** [ODIN: Disentangled Reward Mitigates Hacking in RLHF](https://huggingface.co/papers/2402.07319) |