Label Distribution Learning for Reward Model

  • Authors: Shikai Chen, Jin Yuan, Yang Zhang, Zhongchao Shi, Jianping Fan, Xin Geng, Yong Rui

  • Tech Report Coming soon...

  • Method Overview: This reward model applies Label Distribution Learning (LDL), representing human ratings as probability distributions rather than single values to account for uncertainty and subjectivity. For example, a 3.5-rated sample can, to some extent, describe a 3.4 or 3.6-rated sample. To capture this, we map each real-valued score to a discrete Gaussian distribution. The model is trained in two stages: first, a regression layer learns to predict the label distribution. Second, a gating layer, trained on paired preference data, outputs a set of weights that combine different predicted distributions into a final aggregated distribution. This final distribution is then mapped back to a real-valued score, which is used to compute the Bradley-Terry loss.

    image

Demo Code

import torch
from lenovo import LDLRewardModel27B
from transformers import AutoTokenizer, AutoModelForSequenceClassification
path = 'lenovo/LDL-Reward-Gemma-2-27B-v0.1'
tokenizer = AutoTokenizer.from_pretrained(path)
model = LDLRewardModel27B.from_pretrained(path,device_map='auto')

prompt = "Jane has 12 apples. She gives 4 apples to her friend Mark, then buys 1 more apple, and finally splits all her apples equally among herself and her 2 siblings. How many apples does each person get?"
response1 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among herself and her 2 siblings (3 people in total). 9 ÷ 3 = 3 apples each. Each person gets 3 apples."
response2 = "1. Jane starts with 12 apples and gives 4 to Mark. 12 - 4 = 8. Jane now has 8 apples.\n2. Jane buys 1 more apple. 8 + 1 = 9. Jane now has 9 apples.\n3. Jane splits the 9 apples equally among her 2 siblings (2 people in total). 9 ÷ 2 = 4.5 apples each. Each person gets 4 apples."

conv1 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response1}]
conv2 = [{"role": "user", "content": prompt}, {"role": "assistant", "content": response2}]

conv1_tokenized = tokenizer.apply_chat_template(conv1, tokenize=True, return_tensors="pt")
conv2_tokenized = tokenizer.apply_chat_template(conv2, tokenize=True, return_tensors="pt")

# Get the reward scores
with torch.no_grad():
    score1 = model(conv1_tokenized).logits[0].item()
    score2 = model(conv2_tokenized).logits[0].item()
print(f"Score for response 1: {score1}")
print(f"Score for response 2: {score2}")
#Score for response 1: 7.082010269165039
#Score for response 2: -3.455564498901367

RewardBench Leaderboard

Rank Model Model Type Score Chat Chat Hard Safety Reasoning
1 infly/INF-ORM-Llama3.1-70B Seq. Classifier 95.1 96.6 91.0 93.6 99.1
2 lenovo/LDL-Reward-Gemma-2-27B-v0.1 Seq. Classifier 95.0 96.4 90.8 93.8 99.0
3 nicolinho/ORM-Gemma-2-27B Seq. Classifier 94.4 96.6 90.1 92.7 98.3
4 Skywork/Skywork-Reward-Gemma-2-27B-v0.2 Seq. Classifier 94.3 96.1 89.9 93.0 98.1
5 nvidia/Llama-3.1-Nemotron-70B-Reward Custom Classifier 94.1 97.5 85.7 95.1 98.1
6 Skywork/Skywork-Reward-Gemma-2-27B Seq. Classifier 93.8 95.8 91.4 91.9 96.1
7 SF-Foundation/TextEval-llama3.1-70B Generative 93.5 94.1 90.1 93.2 96.4
8 meta-metrics/MetaMetrics-RM-v1.0 Custom Classifier 93.4 98.3 86.4 90.8 98.2
9 Skywork/Skywork-Critic-llama-3.1-70B Generative 93.3 96.6 87.9 93.1 95.5
10 nicolinho/ORM-Llama3.1-8B-v2 Seq. Classifier 93.1 96.4 86.8 92.6 96.8
11 Skywork/Skywork-Reward-llama-3.1-8B-v0.2 Seq. Classifier 93.1 94.7 88.4 92.7 96.7
12 nicolinho/ORM-llama3.1-8B Seq. Classifier 93.1 94.4 89.7 92.3 95.8
Downloads last month
22
Safetensors
Model size
27.2B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model's library.

Model tree for lenovo/LDL-Reward-Gemma-2-27B-v0.1

Base model

google/gemma-2-27b
Finetuned
(3)
this model