README.md · keyfan/bloomz-rlhf at aab517749f56e69c26a28b0cee00dbc7158c2030

metadata

license: bigscience-bloom-rail-1.0
datasets:
  - OpenAssistant/oasst1
  - RyokoAI/ShareGPT52K
  - Dahoas/full-hh-rlhf
  - liswei/rm-static-m2m100-zh
  - fnlp/moss-002-sft-data
language:
  - zh
  - en

This is an attempt to replicate the RLHF pipeline

Base Model

We used bloomz-7b1-mt because of its less-restricted license and multilingual ability.

Supervised Fintune

For SFT we used a combination of multiple datasets including:

RyokoAI/ShareGPT52K
GPTeacher
Alpaca-GPT4 en & zh
Filtered subset of machine-translated ShareGPT dataset into Chinese

Reward Model

For RM we used the code of reward-modeling repo and datasets from

oasst1
Dahoas/full-hh-rlhf
liswei/rm-static-m2m100-zh

Reinforcement Learning

For RL we used the code of trlx and prompts from

fnlp/moss-002-sft-data