Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

🖥️Code | 🤗Data | 📄Paper

This repo contains the Qwen2-57B-A14B-SFT-Step-DPO model. It is obtained by performing Step-DPO on Qwen2-57B-A14B-SFT.

Step-DPO is a simple, effective, and data-efficient method for boosting the mathematical reasoning ability of LLMs. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K without bells and wistles, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.

Contact

Please submit an issue here or send me an email here.

Downloads last month
4
Safetensors
Model size
57.4B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for xinlai/Qwen2-57B-A14B-SFT-Step-DPO

Quantizations
1 model

Collection including xinlai/Qwen2-57B-A14B-SFT-Step-DPO