Ray2333/Gemma-2B-rewardmodel-baseline · trained dataset and fine-tuned method

Hi, thank you for your interest in our work!

The 2B reward models were trained on the dataset weqweasdas/preference_dataset_mixture2_and_safe_pku (500k size). Interestingly, I observed that scaling the 2B model to 700k data led to a performance drop, whereas 7B/8B models showed improved performance with larger datasets.

Most of my released models are fully parameter-trained, except for Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback. For full-parameter training, we truncate inputs exceeding 4096 tokens and train for one epoch with a learning rate of 2 × 10⁻⁶ and a batch size of 512 (using gradient accumulation). The max length can be tuned—I’ve found that 3000-3500 tokens sometimes yields better results.

Also, you can find our code here https://github.com/YangRui2015/Generalizable-Reward-Model.