---
license: bigscience-bloom-rail-1.0
datasets:
- OpenAssistant/oasst1
- RyokoAI/ShareGPT52K
- Dahoas/full-hh-rlhf
- liswei/rm-static-m2m100-zh
- fnlp/moss-002-sft-data
language:
- zh
- en
---

This is an attempt to replicate the RLHF pipeline

### Base Model
  
  We used [bloomz-7b1-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) because of its less-restricted license and multilingual ability.

### Supervised Fintune

  For SFT we used a combination of multiple datasets including:
  - [RyokoAI/ShareGPT52K](https://huggingface.co/datasets/RyokoAI/ShareGPT52K)
  - [GPTeacher](https://github.com/teknium1/GPTeacher)
  - [Alpaca-GPT4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) en & zh
  - Filtered subset of machine-translated ShareGPT dataset into Chinese

### Reward Model

  For RM we used the code of [reward-modeling](https://github.com/Dahoas/reward-modeling) repo and datasets from
  - [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)
  - [Dahoas/full-hh-rlhf](https://huggingface.co/datasets/Dahoas/full-hh-rlhf)
  - [liswei/rm-static-m2m100-zh](https://huggingface.co/datasets/liswei/rm-static-m2m100-zh)

### Reinforcement Learning

  For RL we used the code of [trlx](https://github.com/CarperAI/trlx) and prompts from
  - [fnlp/moss-002-sft-data](https://huggingface.co/datasets/fnlp/moss-002-sft-data/tree/main)