Model Issue
The current merged version struggles with long-chain reasoning and tends to provide immediate answers directly. Would it be possible to explore re-merging the model to address this limitation?
The current merged version struggles with long-chain reasoning and tends to provide immediate answers directly. Would it be possible to explore re-merging the model to address this limitation?
We find this problem and try to fix it. This might be due to the significantly different parameter space between Qwen2.5-Coder-32B and DeepSeek-R1-32B.
I see a new version has been uploaded @Wanfq . Any comments on the changes in this new release? Does this fix the issue that was discussed here?
Yes, we change the base pretrain model from Qwen2.5-32B to Qwen2.5-32B-Coder. This indeed fix this issue. Results are shown below:
Models | LiveCodeBench | LiveCodeBench-Easy | LiveCodeBench-Medium | LiveCodeBench-Hard |
---|---|---|---|---|
OpenAI o1 | 63.4 | 98.5 | 80.9 | 31.7 |
OpenAI o1-preview | 42.7 | 97.0 | 47.2 | 9.8 |
OpenAI o1-mini | 52.00 | 91.0 | 67.4 | 19.5 |
DeepSeek R1 | 62.8 | 98.4 | 78.3 | 32.2 |
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B | 56.1 | 93.6 | 73.1 | 23.4 |
Qwen/QwQ-32B-Preview | 44.4 | 94.9 | 53.8 | 10.0 |
NovaSky-AI/Sky-T1-32B-Preview | 37.3 | 89.7 | 40.4 | 6.6 |
FuseAI/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview | 56.4 | 92.9 | 73.5 | 24.2 |
FuseAI/FuseO1-DeepSeekR1-QwQ-32B-Preview | 54.8 | 93.9 | 71.7 | 21.3 |
FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-Flash-32B-Preview | 58.2 | 94.3 | 77.1 | 25.0 |
FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview | 57.9 | 93.6 | 76.0 | 25.5 |
@Wanfq Any chance you will do one with Qwen2.5-Coder-32B-Instruct?
This model is now merged based on Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Qwen-32B. The Qwen2.5-32B-Coder is used to automatically calculate the merging weight for these two models only. Since in our SCE merging method, the merging weight is proportional to the delta parameter from the pivot model to the target model.
Here is the merging config for this model:
models:
# Pivot model
- model: Qwen/Qwen2.5-32B-Coder
# Target models
- model: Qwen/Qwen2.5-32B-Coder-Instruct
- model: DeepSeek-R1-Distill-Qwen-32B
merge_method: sce
base_model: Qwen/Qwen2.5-32B-Coder
parameters:
select_topk: 1.0
dtype: bfloat16