Model reproducability
Would a "common man" developer be able to (roughly) reproduce a reward model like this simply by training on the Skywork dataset using a library like Huggingface's trl, which has a RewardTrainer?
I'm mostly looking for a budget way to train a model that is excellent at following instructions but for which I have given it plenty of domain-specific structuring, style, and knowledge. I plan to integrate my domain-specific training examples with a much larger instruction dataset because after trying to fine-tune on my specific examples in the past, the model would lose its more general instruction-following ability. After training the reward model, I believe I would then fine-tune an existing model using PPO (also possible with the trl library). My domain-specific training examples are often rather complex, so I think I need a reward model.
Thanks!