Lichang-Chen
/

ODIN-ppo-L230-best

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

ODIN-ppo-L230-best / README.md

Lichang-Chen's picture

Update README.md

38eac53 verified 12 months ago

|

history blame contribute delete

957 Bytes

	---
	license: mit
	language:
	- en
	tags:
	- ODIN
	- RLHF
	- PPO
	---

	## Model Details
	This is an official implementation of ODIN-ppo-L230-7B model, which is a chat assistant trained by fine-tuning LLaMA on Open-Assistant dataset via PPO.
	The L230 means the output length in LIMA test set is ~230. ODIN is the reward model for the training.

	## Model Description


	<!-- Provide a longer summary of what this model is. -->
	- Developed by: [Lichang-Chen](https://huggingface.co/Lichang-Chen) and [Chen Zhu](https://scholar.google.com/citations?hl=zh-CN&user=m-om5O8AAAAJ)
	- Model type: RLHF model.
	- Language(s) (NLP): English
	- Finetuned from model: [Vicuna-7b](https://huggingface.co/lmsys/vicuna-7b-v1.5)

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: [ODIN](https://github.com/Lichang-Chen/ODIN)
	- Paper: [ODIN: Disentangled Reward Mitigates Hacking in RLHF](https://huggingface.co/papers/2402.07319)