arxiv:2410.18252

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Published on Oct 23

· Submitted by

mnoukhov on Oct 25

Upvote

Authors:

Michael Noukhovitch ,

Abstract

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.

View arXiv page View PDF Add to collection

Community

mnoukhov

Paper author Paper submitter 18 days ago

Asynchronous RLHF! A faster, more efficient paradigm for language model and RL training.

Standard RLHF is forced to be synchronous: online, on-policy RL. To take advantage of LLM generation libraries and efficiencies (e.g. vllm), we put generation and training on separate GPUs. This makes training off-policy but allows us to achieve big speedups. These speedups increase with scale but performance is matched!

paper: https://arxiv.org/abs/2410.18252
code: https://github.com/mnoukhov/async_rlhf
hf collection: https://huggingface.co/collections/mnoukhov/asynchronous-rlhf-6717bee31de7be3bcb0ce800

librarian-bot

18 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.18252 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.18252 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.18252 in a Space README.md to link it from this page.