This model is trained with Iterative DPO in OpenRLHF | |
Datasets and Hyperparameters | |
``` | |
Reward Model:https://huggingface.co/OpenLLMAI/Llama-3-8b-rm-700k | |
SFT Model: https://huggingface.co/OpenLLMAI/Llama-3-8b-sft-mixture | |
Prompt Dataset: https://huggingface.co/datasets/OpenLLMAI/prompt-collection-v0.1 | |
best_of_n: 2 (2 samples for each prompt) | |
Learning Rate: 5e-7 | |
Beta: 0.1 | |
Scheduler: Cosine with Warmup and MinLR | |
Rollout Batch Size: 20000 | |
Training Batch Size: 256 | |
Number of Iterations: 9 | |
``` |