Model Issue

by YOYO-AI - opened 10 days ago

10 days ago

The current merged version struggles with long-chain reasoning and tends to provide immediate answers directly. Would it be possible to explore re-merging the model to address this limitation?

Wanfq

FuseAI org 10 days ago

The current merged version struggles with long-chain reasoning and tends to provide immediate answers directly. Would it be possible to explore re-merging the model to address this limitation?

We find this problem and try to fix it. This might be due to the significantly different parameter space between Qwen2.5-Coder-32B and DeepSeek-R1-32B.

Mushoz

9 days ago

I see a new version has been uploaded @Wanfq . Any comments on the changes in this new release? Does this fix the issue that was discussed here?

Wanfq

FuseAI org 9 days ago

I see a new version has been uploaded @Wanfq . Any comments on the changes in this new release? Does this fix the issue that was discussed here?

Yes, we change the base pretrain model from Qwen2.5-32B to Qwen2.5-32B-Coder. This indeed fix this issue. Results are shown below:

Models	LiveCodeBench	LiveCodeBench-Easy	LiveCodeBench-Medium	LiveCodeBench-Hard
OpenAI o1	63.4	98.5	80.9	31.7
OpenAI o1-preview	42.7	97.0	47.2	9.8
OpenAI o1-mini	52.00	91.0	67.4	19.5
DeepSeek R1	62.8	98.4	78.3	32.2
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	56.1	93.6	73.1	23.4
Qwen/QwQ-32B-Preview	44.4	94.9	53.8	10.0
NovaSky-AI/Sky-T1-32B-Preview	37.3	89.7	40.4	6.6
FuseAI/FuseO1-DeepSeekR1-Qwen2.5-Coder-32B-Preview	56.4	92.9	73.5	24.2
FuseAI/FuseO1-DeepSeekR1-QwQ-32B-Preview	54.8	93.9	71.7	21.3
FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-Flash-32B-Preview	58.2	94.3	77.1	25.0
FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview	57.9	93.6	76.0	25.5

valoomba

1 day ago

@Wanfq Any chance you will do one with Qwen2.5-Coder-32B-Instruct?

Wanfq

FuseAI org 1 day ago

@Wanfq Any chance you will do one with Qwen2.5-Coder-32B-Instruct?

This model is now merged based on Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Qwen-32B. The Qwen2.5-32B-Coder is used to automatically calculate the merging weight for these two models only. Since in our SCE merging method, the merging weight is proportional to the delta parameter from the pivot model to the target model.

Here is the merging config for this model:

models:
  # Pivot model
  - model: Qwen/Qwen2.5-32B-Coder
  # Target models
  - model: Qwen/Qwen2.5-32B-Coder-Instruct
  - model: DeepSeek-R1-Distill-Qwen-32B
merge_method: sce
base_model: Qwen/Qwen2.5-32B-Coder
parameters:
  select_topk: 1.0
dtype: bfloat16

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment