Safetensors
qwen2

Model Issue

#1
by YOYO-AI - opened

The current merged version struggles with long-chain reasoning and tends to provide immediate answers directly. Would it be possible to explore re-merging the model to address this limitation?

FuseAI org

The current merged version struggles with long-chain reasoning and tends to provide immediate answers directly. Would it be possible to explore re-merging the model to address this limitation?

We find this problem and try to fix it. This might be due to the significantly different parameter space between Qwen2.5-Coder-32B and DeepSeek-R1-32B.

I see a new version has been uploaded @Wanfq . Any comments on the changes in this new release? Does this fix the issue that was discussed here?

FuseAI org

I see a new version has been uploaded @Wanfq . Any comments on the changes in this new release? Does this fix the issue that was discussed here?

Yes, we change the base pretrain model from Qwen2.5-32B to Qwen2.5-32B-Coder. This indeed fix this issue. Results are shown below:

Models LiveCodeBench LiveCodeBench-Easy LiveCodeBench-Medium LiveCodeBench-Hard
OpenAI o1 63.4 98.5 80.9 31.7
OpenAI o1-preview 42.7 97.0 47.2 9.8
OpenAI o1-mini 52.00 91.0 67.4 19.5
DeepSeek R1 62.8 98.4 78.3 32.2
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 56.1 93.6 73.1 23.4
Qwen/QwQ-32B-Preview 44.4 94.9 53.8 10.0
NovaSky-AI/Sky-T1-32B-Preview 37.3 89.7 40.4 6.6
FuseAI/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview 56.4 92.9 73.5 24.2
FuseAI/FuseO1-DeepSeekR1-QwQ-32B-Preview 54.8 93.9 71.7 21.3
FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-Flash-32B-Preview 58.2 94.3 77.1 25.0
FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview 57.9 93.6 76.0 25.5

@Wanfq Any chance you will do one with Qwen2.5-Coder-32B-Instruct?

FuseAI org

@Wanfq Any chance you will do one with Qwen2.5-Coder-32B-Instruct?

This model is now merged based on Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Qwen-32B. The Qwen2.5-32B-Coder is used to automatically calculate the merging weight for these two models only. Since in our SCE merging method, the merging weight is proportional to the delta parameter from the pivot model to the target model.

Here is the merging config for this model:

models:
  # Pivot model
  - model: Qwen/Qwen2.5-32B-Coder
  # Target models
  - model: Qwen/Qwen2.5-32B-Coder-Instruct
  - model: DeepSeek-R1-Distill-Qwen-32B
merge_method: sce
base_model: Qwen/Qwen2.5-32B-Coder
parameters:
  select_topk: 1.0
dtype: bfloat16

Sign up or log in to comment