trained dataset and fine-tuned method

#1
by glgjss960 - opened

Hi,

Thanks for your open-sourced reward models. I have two extra question:
1.Is Gemma-2B-rewardmodel-baseline model trained using hendrydong/preference_700K dataset or weqweasdas/preference_dataset_mixture2_and_safe_pku dataset? If I'm not mistaken, these are two different datasets.
2.Is Gemma-2B-rewardmodel-baseline a full-parameter fine-tuned model or a lora-fine-tuned model?What's more,what are the max_length,lora-r and learning rate of the model?
The question.2 is held for GRM-Gemma2-2B-sftreg model as well.

Thanks!

Hi, thank you for your interest in our work!

The 2B reward models were trained on the dataset weqweasdas/preference_dataset_mixture2_and_safe_pku (500k size). Interestingly, I observed that scaling the 2B model to 700k data led to a performance drop, whereas 7B/8B models showed improved performance with larger datasets.

Most of my released models are fully parameter-trained, except for Ray2333/reward-model-Mistral-7B-instruct-Unified-Feedback. For full-parameter training, we truncate inputs exceeding 4096 tokens and train for one epoch with a learning rate of 2 × 10⁻⁶ and a batch size of 512 (using gradient accumulation). The max length can be tuned—I’ve found that 3000-3500 tokens sometimes yields better results.

Also, you can find our code here https://github.com/YangRui2015/Generalizable-Reward-Model.

Sign up or log in to comment