Label Distribution Learning for Reward Model
Authors: Shikai Chen, Jin Yuan, Yang Zhang, Zhongchao Shi, Jianping Fan, Xin Geng, Yong Rui
Tech Report Coming soon...
Method Overview: This reward model applies Label Distribution Learning (LDL), representing human ratings as probability distributions rather than single values to account for uncertainty and subjectivity. For example, a 3.5-rated sample can, to some extent, describe a 3.4 or 3.6-rated sample. To capture this, we map each real-valued score to a discrete Gaussian distribution. The model is trained in two stages: first, a regression layer learns to predict the label distribution. Second, a gating layer, trained on paired preference data, outputs a set of weights that combine different predicted distributions into a final aggregated distribution. This final distribution is then mapped back to a real-valued score, which is used to compute the Bradley-Terry loss.
Demo Code
import torch
from lenovo import LDLRewardModel27B
from transformers import AutoTokenizer, AutoModelForSequenceClassification
path = 'lenovo/LDL-Reward-Gemma-2-27B-v0.1'
tokenizer = AutoTokenizer.from_pretrained(path)
model = LDLRewardModel27B.from_pretrained(path,device_map='auto')
prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
response1 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."
response2 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."
conv1 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response1}]
conv2 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response2}]
conv1_tokenized = tokenizer.apply_chat_template(conv1, tokenize=True, return_tensors="pt")
conv2_tokenized = tokenizer.apply_chat_template(conv2, tokenize=True, return_tensors="pt")
# Get the reward scores
with torch.no_grad():
score1 = model(conv1_tokenized).logits[0].item()
score2 = model(conv2_tokenized).logits[0].item()
print(f"Score for response 1: {score1}")
print(f"Score for response 2: {score2}")
#Score for response 1: 7.082010269165039
#Score for response 2: -3.455564498901367
RewardBench Leaderboard
Rank | Model | Model Type | Score | Chat | Chat Hard | Safety | Reasoning |
---|---|---|---|---|---|---|---|
1 | infly/INF-ORM-Llama3.1-70B | Seq. Classifier | 95.1 | 96.6 | 91.0 | 93.6 | 99.1 |
2 | lenovo/LDL-Reward-Gemma-2-27B-v0.1 | Seq. Classifier | 95.0 | 96.4 | 90.8 | 93.8 | 99.0 |
3 | nicolinho/ORM-Gemma-2-27B | Seq. Classifier | 94.4 | 96.6 | 90.1 | 92.7 | 98.3 |
4 | Skywork/Skywork-Reward-Gemma-2-27B-v0.2 | Seq. Classifier | 94.3 | 96.1 | 89.9 | 93.0 | 98.1 |
5 | nvidia/Llama-3.1-Nemotron-70B-Reward | Custom Classifier | 94.1 | 97.5 | 85.7 | 95.1 | 98.1 |
6 | Skywork/Skywork-Reward-Gemma-2-27B | Seq. Classifier | 93.8 | 95.8 | 91.4 | 91.9 | 96.1 |
7 | SF-Foundation/TextEval-llama3.1-70B | Generative | 93.5 | 94.1 | 90.1 | 93.2 | 96.4 |
8 | meta-metrics/MetaMetrics-RM-v1.0 | Custom Classifier | 93.4 | 98.3 | 86.4 | 90.8 | 98.2 |
9 | Skywork/Skywork-Critic-llama-3.1-70B | Generative | 93.3 | 96.6 | 87.9 | 93.1 | 95.5 |
10 | nicolinho/ORM-Llama3.1-8B-v2 | Seq. Classifier | 93.1 | 96.4 | 86.8 | 92.6 | 96.8 |
11 | Skywork/Skywork-Reward-llama-3.1-8B-v0.2 | Seq. Classifier | 93.1 | 94.7 | 88.4 | 92.7 | 96.7 |
12 | nicolinho/ORM-llama3.1-8B | Seq. Classifier | 93.1 | 94.4 | 89.7 | 92.3 | 95.8 |
- Downloads last month
- 22
Model tree for lenovo/LDL-Reward-Gemma-2-27B-v0.1
Base model
google/gemma-2-27b