File size: 1,366 Bytes
16b9638 aab5177 16b9638 aab5177 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
---
license: bigscience-bloom-rail-1.0
datasets:
- OpenAssistant/oasst1
- RyokoAI/ShareGPT52K
- Dahoas/full-hh-rlhf
- liswei/rm-static-m2m100-zh
- fnlp/moss-002-sft-data
language:
- zh
- en
---
This is an attempt to replicate the RLHF pipeline
### Base Model
We used [bloomz-7b1-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) because of its less-restricted license and multilingual ability.
### Supervised Fintune
For SFT we used a combination of multiple datasets including:
- [RyokoAI/ShareGPT52K](https://huggingface.co/datasets/RyokoAI/ShareGPT52K)
- [GPTeacher](https://github.com/teknium1/GPTeacher)
- [Alpaca-GPT4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) en & zh
- Filtered subset of machine-translated ShareGPT dataset into Chinese
### Reward Model
For RM we used the code of [reward-modeling](https://github.com/Dahoas/reward-modeling) repo and datasets from
- [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1)
- [Dahoas/full-hh-rlhf](https://huggingface.co/datasets/Dahoas/full-hh-rlhf)
- [liswei/rm-static-m2m100-zh](https://huggingface.co/datasets/liswei/rm-static-m2m100-zh)
### Reinforcement Learning
For RL we used the code of [trlx](https://github.com/CarperAI/trlx) and prompts from
- [fnlp/moss-002-sft-data](https://huggingface.co/datasets/fnlp/moss-002-sft-data/tree/main) |