Why the chosen rewards are negative?

#34
by GeneZC - opened

Why both the chosen rewards and rejected rewards are negative, though the reward margins are positive.
The negative chosen rewards essentially indicate that the optimized model does not assign higher probabilities for chosen examples than the reference one, which is not that reasonable.

A potential explanation is that the optimized model will emphasize on some tokens instead of all tokens while the reference model equally emphasize all tokens. The same may apply with the rejected examples. Therefore, the reward margins are still positive.

Sign up or log in to comment