Description

mistral-7b-sft-beta model finetuned by hybrid WPO (GPT-4-turbo + on-policy sampling + Ultrafeedback). Details in WPO: Enhancing RLHF with Weighted Preference Optimization. The training data is wzhouad/zephyr-ultrafeedback-hybrid.

License

This model is licensed under the Zoom software license and is permitted for use only for noncommercial, educational, or academic research purposes.

Downloads last month: 4

Safetensors

Model size

7.24B params

Tensor type

F32

Inference Providers NEW

Text Generation

This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Dataset used to train wzhouad/zephyr-7B-WPO-HB

Collection including wzhouad/zephyr-7B-WPO-HB

WPO

Collection

Models and datasets in paper "WPO: Enhancing RLHF with Weighted Preference Optimization". • 11 items • Updated Aug 22, 2024 • 6