lenovo/LDL-Reward-Gemma-2-27B-v0.1

Label Distribution Learning for Reward Model

Authors: Shikai Chen, Jin Yuan, Yang Zhang, Zhongchao Shi, Jianping Fan, Xin Geng, Yong Rui
Tech Report Coming soon...
Method Overview: This reward model applies Label Distribution Learning (LDL), representing human ratings as probability distributions rather than single values to account for uncertainty and subjectivity. For example, a 3.5-rated sample can, to some extent, describe a 3.4 or 3.6-rated sample. To capture this, we map each real-valued score to a discrete Gaussian distribution. The model is trained in two stages: first, a regression layer learns to predict the label distribution. Second, a gating layer, trained on paired preference data, outputs a set of weights that combine different predicted distributions into a final aggregated distribution. This final distribution is then mapped back to a real-valued score, which is used to compute the Bradley-Terry loss.

Demo Code

import torch
from lenovo import LDLRewardModel27B
from transformers import AutoTokenizer, AutoModelForSequenceClassification
path = 'lenovo/LDL-Reward-Gemma-2-27B-v0.1'
tokenizer = AutoTokenizer.from_pretrained(path)
model = LDLRewardModel27B.from_pretrained(path,device_map='auto')

prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
response1 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."
response2 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."

conv1 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response1}]
conv2 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response2}]

conv1_tokenized = tokenizer.apply_chat_template(conv1, tokenize=True, return_tensors="pt")
conv2_tokenized = tokenizer.apply_chat_template(conv2, tokenize=True, return_tensors="pt")

# Get the reward scores
with torch.no_grad():
    score1 = model(conv1_tokenized).logits[0].item()
    score2 = model(conv2_tokenized).logits[0].item()
print(f"Score for response 1: {score1}")
print(f"Score for response 2: {score2}")
#Score for response 1: 7.082010269165039
#Score for response 2: -3.455564498901367

RewardBench Leaderboard

Rank	Model	Model Type	Score	Chat	Chat Hard	Safety	Reasoning
1	infly/INF-ORM-Llama3.1-70B	Seq. Classifier	95.1	96.6	91.0	93.6	99.1
2	lenovo/LDL-Reward-Gemma-2-27B-v0.1	Seq. Classifier	95.0	96.4	90.8	93.8	99.0
3	nicolinho/ORM-Gemma-2-27B	Seq. Classifier	94.4	96.6	90.1	92.7	98.3
4	Skywork/Skywork-Reward-Gemma-2-27B-v0.2	Seq. Classifier	94.3	96.1	89.9	93.0	98.1
5	nvidia/Llama-3.1-Nemotron-70B-Reward	Custom Classifier	94.1	97.5	85.7	95.1	98.1
6	Skywork/Skywork-Reward-Gemma-2-27B	Seq. Classifier	93.8	95.8	91.4	91.9	96.1
7	SF-Foundation/TextEval-llama3.1-70B	Generative	93.5	94.1	90.1	93.2	96.4
8	meta-metrics/MetaMetrics-RM-v1.0	Custom Classifier	93.4	98.3	86.4	90.8	98.2
9	Skywork/Skywork-Critic-llama-3.1-70B	Generative	93.3	96.6	87.9	93.1	95.5
10	nicolinho/ORM-Llama3.1-8B-v2	Seq. Classifier	93.1	96.4	86.8	92.6	96.8
11	Skywork/Skywork-Reward-llama-3.1-8B-v0.2	Seq. Classifier	93.1	94.7	88.4	92.7	96.7
12	nicolinho/ORM-llama3.1-8B	Seq. Classifier	93.1	94.4	89.7	92.3	95.8

lenovo
/

LDL-Reward-Gemma-2-27B-v0.1

Demo Code

RewardBench Leaderboard

Model tree for lenovo/LDL-Reward-Gemma-2-27B-v0.1