license: cc-by-sa-4.0
This model is the RLHF version of HuggingFaceH4/mistral-7b-sft-beta
without any external responses.
We perform GSHF algorithm on SFT baseline. The external signals include (1) Reward model; (2) AI-generated Prompts.
We obtain 35.95% win-rate (34.79% LC win-rate) on Alpaca Eval v2. The win-rate of the base model is only 4.63%.
For MT-bench, it obtained about 7.5, where the base model is only 5.3.
We have demonstrated the significant potential of the iterative RLHF algorithm for LLMs to deliver appropriate and well-structured responses, even without any external responses.
Model Details
We perform 3 iterations of GSHF algorithm on HuggingFaceH4/mistral-7b-sft-beta
labeled by reward model, where prompts are generated by ChatGPT with self-instruct type prompt augmentation.
We use AI-generated 60K prompts in the training process.
Examples are as below,
{"prompt": "Why is gold considered a good reserve asset for central banks?"}
{"prompt": "What are the top 5 yoga poses for stress relief?"}
{"prompt": "Craft a blog title about the health implications of eating avocados daily based on their caloric value."}
{"prompt": "Design a simple HTML chat interface that simulates a conversation between a user and a bot, displaying two messages from each."}
{"prompt": "List 10 names from different cultures that embody the meanings of peace, harmony, or compassion."}
Uses
The usage and chat template format follow the SFT model HuggingFaceH4/mistral-7b-sft-beta
.
# Install transformers from source - only needed for versions <= v4.34
# pip install git+https://github.com/huggingface/transformers.git
# pip install accelerate
import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="sfairXC/FsfairX-Zephyr-Chat-v0.1", torch_dtype=torch.bfloat16, device_map="auto")
# We use the tokenizer's chat template to format each message - see https://huggingface.co/docs/transformers/main/en/chat_templating
messages = [
{"role": "system", "content": "You are a friendly chatbot who always responds in the style of a pirate"},
{"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
]
prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
# <|system|>
# You are a friendly chatbot who always responds in the style of a pirate.</s>
# <|user|>
# How many helicopters can a human eat in one sitting?</s>
# <|assistant|>
# Ah, me hearty matey! But yer question be a puzzler! A human cannot eat a helicopter in one sitting, as helicopters are not edible. They be made of metal, plastic, and other materials, not food!
Evaluation
The evaluation on Alpaca Eval v2 are provided as below,
Model | Win Rate | LC Win Rate | Avg Length |
---|---|---|---|
Base | 4.63 | 8.01 | 916 |
Iteration 1 | 13.26 | 20.81 | 1205 |
Iteration 2 | 23.57 | 27.63 | 1623 |
Iteration 3 | 35.95 | 34.79 | 2275 |
Citation
If you found this helpful, please cite the following papers.
@article{dong2023raft,
title={Raft: Reward ranked finetuning for generative foundation model alignment},
author={Dong, Hanze and Xiong, Wei and Goyal, Deepanshu and Pan, Rui and Diao, Shizhe and Zhang, Jipeng and Shum, Kashun and Zhang, Tong},
journal={arXiv preprint arXiv:2304.06767},
year={2023}
}
@misc{xiong2024iterative,
title={Iterative Preference Learning from Human Feedback: Bridging Theory and Practice for RLHF under KL-Constraint},
author={Wei Xiong and Hanze Dong and Chenlu Ye and Ziqi Wang and Han Zhong and Heng Ji and Nan Jiang and Tong Zhang},
year={2024},
eprint={2312.11456},
archivePrefix={arXiv},
primaryClass={cs.LG}
}