Experimenting with different training objectives for an AI evaluator
Lots of research has been published around LLM-as-a-judge as it's becoming a popular approach to evaluate cheap + fast. A pretty cool paper that recently came out was from the Salesforce AI Research team ; tldr: they found preference optimisation techniques like DPO and RPO could yield better results than supervised fine-tuning (SFT) alone as a training objective for LLM-as-a-judge models. Our team wanted to test this hypothesis as it it's not yet clear which training objective performs best for aligning eval models...
Our experiments
We trained a Llama-3.1-70B-Instruct with SFT and compared it to base Llama-3.1-70B-Instruct on core benchmarks to see how SFT fares alone.
We also trained a Llama-3.1-8B-Instruct model on two training datasets with
- Purely SFT
- DPO
- RPO (compound loss objective incorporates both SFT and DPO)
and compared their performance against the base model across four core benchmarks covering both Pairwise Preference and Direct Scoring.
Here's a summary of our key findings:
- SFT (Atla Caprioska 70B) showed improvements on in-distribution tasks whereas quality dropped on out-of-distribution tasks, underperforming base Llama-70B on aggregate metrics
- DPO performed best on PreferenceCollection with 98.89% accuracy
- RPO performed best on RewardBench with 81.96% accuracy
- RPO outperformed both SFT and DPO on UltraFeedback (No CoT), with a score of 0.57
- RPO achieved the highest average Pearson correlation on evaluation scores (0.49), compared to SFT (0.43) and DPO (0.43)
If you want the experiment details, here's our blog post - with extra information on why we think this works. We're working on scaling this up and seeing how far we can push this thing :)
Open questions for you all
- Will this trend hold for larger models?
- What kind of data might be particularly useful for training an LLM-as-a-judge?