This model is trained with Iterative DPO in OpenRLHF Datasets and Hyperparameters - Reward Model:https://huggingface.co/OpenLLMAI/Llama-3-8b-rm-700k - SFT Model: https://huggingface.co/OpenLLMAI/Llama-3-8b-sft-mixture - Prompt Dataset: https://huggingface.co/datasets/OpenLLMAI/prompt-collection-v0.1 ``` Max Prompt Length: 2048 Max Response Length: 2048 best_of_n: 2 (2 samples for each prompt) Learning Rate: 5e-7 Beta: 0.1 Scheduler: Cosine with Warmup (0.03) and MinLR (0.1 * init_lr) Rollout Batch Size: 20000 Training Batch Size: 256 Number of Iterations: 9 ``` Evaluation ``` ########## First turn ########## score model turn Llama3-iter-dpo 1 8.55 ########## Second turn ########## score model turn Llama3-iter-dpo 2 7.95625 ########## Average ########## score model Llama3-iter-dpo 8.253125 Llama3-sft-baseline 7.69 ```