--- license: bigscience-bloom-rail-1.0 datasets: - OpenAssistant/oasst1 - RyokoAI/ShareGPT52K - Dahoas/full-hh-rlhf - liswei/rm-static-m2m100-zh - fnlp/moss-002-sft-data language: - zh - en --- This is an attempt to replicate the RLHF pipeline ### Base Model We used [bloomz-7b1-mt](https://huggingface.co/bigscience/bloomz-7b1-mt) because of its less-restricted license and multilingual ability. ### Supervised Fintune For SFT we used a combination of multiple datasets including: - [RyokoAI/ShareGPT52K](https://huggingface.co/datasets/RyokoAI/ShareGPT52K) - [GPTeacher](https://github.com/teknium1/GPTeacher) - [Alpaca-GPT4](https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM) en & zh - Filtered subset of machine-translated ShareGPT dataset into Chinese ### Reward Model For RM we used the code of [reward-modeling](https://github.com/Dahoas/reward-modeling) repo and datasets from - [oasst1](https://huggingface.co/datasets/OpenAssistant/oasst1) - [Dahoas/full-hh-rlhf](https://huggingface.co/datasets/Dahoas/full-hh-rlhf) - [liswei/rm-static-m2m100-zh](https://huggingface.co/datasets/liswei/rm-static-m2m100-zh) ### Reinforcement Learning For RL we used the code of [trlx](https://github.com/CarperAI/trlx) and prompts from - [fnlp/moss-002-sft-data](https://huggingface.co/datasets/fnlp/moss-002-sft-data/tree/main)