Elrich akira's picture

1

Elrich akira

elrich666

·

AI & ML interests

None yet

Recent Activity

liked a model 13 days ago

Novaciano/Llama-3.2-3b-NSFW_Aesir_Uncensored-GGUF

reacted to Jaward's post with 👍 14 days ago

The beauty in GRPO is the fact that it doesn’t care if the rewards are rule-based or learned, the hack: let the data self-normalize— trajectories in a batch compete against their mean, no value model, no extra params, just clean, efficient RL that cuts memory usage by 50%, while maintaining SOTA performance. btw it was introduced 9months prior to R1: arxiv.org/pdf/2402.03300

reacted to Jaward's post with 🔥 14 days ago

The beauty in GRPO is the fact that it doesn’t care if the rewards are rule-based or learned, the hack: let the data self-normalize— trajectories in a batch compete against their mean, no value model, no extra params, just clean, efficient RL that cuts memory usage by 50%, while maintaining SOTA performance. btw it was introduced 9months prior to R1: arxiv.org/pdf/2402.03300

View all activity

Organizations

None yet

elrich666's activity

liked a model 13 days ago

Novaciano/Llama-3.2-3b-NSFW_Aesir_Uncensored-GGUF

Updated 30 days ago • 730 • 10

reacted to Jaward's post with 👍🔥 14 days ago

Post

1505

The beauty in GRPO is the fact that it doesn’t care if the rewards are rule-based or learned, the hack: let the data self-normalize— trajectories in a batch compete against their mean, no value model, no extra params, just clean, efficient RL that cuts memory usage by 50%, while maintaining SOTA performance. btw it was introduced 9months prior to R1: arxiv.org/pdf/2402.03300

1 reply

·