metadata
license: bigscience-bloom-rail-1.0
datasets:
- OpenAssistant/oasst1
- RyokoAI/ShareGPT52K
- Dahoas/full-hh-rlhf
- liswei/rm-static-m2m100-zh
- fnlp/moss-002-sft-data
language:
- zh
- en
This is an attempt to replicate the RLHF pipeline
Base Model
We used bloomz-7b1-mt because of its less-restricted license and multilingual ability.
Supervised Fintune
For SFT we used a combination of multiple datasets including:
- RyokoAI/ShareGPT52K
- GPTeacher
- Alpaca-GPT4 en & zh
- Filtered subset of machine-translated ShareGPT dataset into Chinese
Reward Model
For RM we used the code of reward-modeling repo and datasets from
Reinforcement Learning
For RL we used the code of trlx and prompts from