Elrich akira's picture
1

Elrich akira

elrich666
·

AI & ML interests

None yet

Recent Activity

Organizations

None yet

elrich666's activity

reacted to Jaward's post with 👍🔥 14 days ago
view post
Post
1505
The beauty in GRPO is the fact that it doesn’t care if the rewards are rule-based or learned, the hack: let the data self-normalize— trajectories in a batch compete against their mean, no value model, no extra params, just clean, efficient RL that cuts memory usage by 50%, while maintaining SOTA performance. btw it was introduced 9months prior to R1: arxiv.org/pdf/2402.03300
  • 1 reply
·