Description

Llama3-Instruct-8B model finetuned by hybrid WPO (GPT-4-turbo + on-policy sampling + Ultrafeedback). Details in WPO: Enhancing RLHF with Weighted Preference Optimization. The model is trained based on wzhouad/llama3-ultrafeedback-hybrid.

License

This model is licensed under the Zoom software license and is permitted for use only for noncommercial, educational, or academic research purposes.

Downloads last month: 6

Safetensors

Model size

8.03B params

Tensor type

F32

Inference Providers NEW

Text Generation

This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Dataset used to train wzhouad/Llama3-Instruct-8B-WPO-HB

Collection including wzhouad/Llama3-Instruct-8B-WPO-HB

WPO

Collection

Models and datasets in paper "WPO: Enhancing RLHF with Weighted Preference Optimization". • 11 items • Updated Aug 22, 2024 • 6