System Prompt

by Wanfq - opened Jan 20

Discussion

Wanfq

Jan 20

What is the system prompt for the distilled model?

AaronFeng753

Jan 20

I'm using the QwQ system prompt, seems works just fine

You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.

ubergarm

Jan 20

I have a quick example output with system/user prompts and benchmarks for 3090TI FE running a bnb-4bit quant locally posted at r/LocalLLaMA

I am still experimenting as even with temperature in the suggested range 0.5~0.8 it can get hung up repeatedly second guessing itself in a loop.

Wanfq

Jan 20

We have tested the system prompt with temperature of 0.7.

You are a helpful and harmless assistant. You should think step-by-step.

Here are the evaluation results.

Models	AIME24	MATH500	GSM8K	GPQA-Diamond	ARC-Challenge	MMLU-Pro	MMLU	LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	46.67	88.20	-	57.58	-	-	-	-

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

ubergarm

Jan 20

@Wanfq wow, you guys are fast, i see you just released a merge FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview today?!

the benchmark numbers on your merge are looking good! I was using Sky-T1-32B up until today when DeepSeek-R1-Distill-Qwen-32B landed.

can't wait to try out your merge after the GGUFs land! Though been a busy day for @bartowski already... haha! 🎉

cheers!

AaronFeng753

Jan 21

@Wanfq well those scores are significantly lower than deepseek's, I wonder if they included how they setup the test environment in the paper

AaronFeng753

Jan 21

Have you tried "no system prompt"? Since deepseek V3 also never got any official system prompt, maybe their new models performs the best without any system prompt

ubergarm

Jan 21

•

edited Jan 21

DeepSeek-R1-Evaluation

For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.

Distilled Model Evaluation

Model	AIME 2024 pass@1	AIME 2024 cons@64
DeepSeek-R1-Distill-Qwen-32B	72.6	83.3

cons@64 (majority voting by 64 model calls) and pass@1 (success rate in one call) No inference-time search or sampling

Wanfq

Jan 21

Have you tried "no system prompt"? Since deepseek V3 also never got any official system prompt, maybe their new models performs the best without any system prompt

I have tryed no system prompt in my early attempt. The results are close to the "You are a helpful and harmless assistant. You should think step-by-step."

Wanfq

Jan 21

DeepSeek-R1-Evaluation

For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.

Distilled Model Evaluation

Model AIME 2024 pass@1 AIME 2024 cons@64

DeepSeek-R1-Distill-Qwen-32B 72.6 83.3

cons@64 (majority voting by 64 model calls) and pass@1 (success rate in one call) No inference-time search or sampling

We use a temperature of 0.7, a maximum length of 32768, and the evaluation code is based on https://github.com/NovaSky-AI/SkyThought to calculate the pass@1.

YeungNLP

Jan 21

We have tested the system prompt with temperature of 0.7.
You are a helpful and harmless assistant. You should think step-by-step.
Here are the evaluation results.

Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench

deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 - 57.58 - - - -

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

In the readme: DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models.
Maybe you can try to set the system as "You are a helpful assistant.", it's the same as Qwen2.5.
@Wanfq

Wanfq

Jan 21

We have tested the system prompt with temperature of 0.7.
You are a helpful and harmless assistant. You should think step-by-step.
Here are the evaluation results.

Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench

deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 - 57.58 - - - -

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

The evaluation code is modified from SkyThought. In our evaluation, we set the temperature to 0.7 and the max_tokens to 32768. We provide the example to reproduce our results in evaluation.

The system prompt for evaluation is set to:

You are a helpful and harmless assistant. You should think step-by-step.

We are currently attempting to reproduce the results reported in the DeepSeek-R1 paper by experimenting with different system prompts. We will update our findings once we have acquired the original system prompt used in their study.

The updated evaluation results are presented here:

Models	AIME24	MATH500	GSM8K	GPQA-Diamond	ARC-Challenge	MMLU-Pro	MMLU	LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	46.67	88.20	93.71	57.58	95.90	68.70	82.17	59.69

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

noahzs

Jan 21

The paper says they used a Top P of 0.95, and temp of 0.6 for their benchmarks.
How have people been setting Top K, Repeat Penalty, and Min P to get the best results?

bartowski

Jan 21

it's up btw @ubergarm https://huggingface.co/bartowski/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview-GGUF

Lingrui

Jan 23

This comment has been hidden

Wanfq

Jan 23

•

edited Jan 23

it's up btw @ubergarm https://huggingface.co/bartowski/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview-GGUF

We find the evaluation results for math and code are not correct in our current version. To address this issue, we use the code from Qwen2.5-Math and Qwen2.5-Coder for math and code evaluation. With this approach, we have successfully reproduced the results reported in the DeepSeek-R1 paper.

We have finished all the evaluation and updated the results here:

The reproduce details can be found in our blog: https://huggingface.co/blog/Wanfq/fuseo1-preview

We also provide the code in our github repo: https://github.com/fanqiwan/FuseAI/tree/main/FuseO1-Preview

Our models are in : https://huggingface.co/collections/FuseAI/fuseo1-preview-678eb56093649b2688bc9977

Have fun!

fblgit

Jan 28

how about some contamination test.. using R1-32B as base in contrast against the frankenmerger ...

haili-tian

30 days ago

system_prompt is not recommended, as stated in model's card.

bsvaz

20 days ago

•

edited 20 days ago

system_prompt is not recommended, as stated in model's card.

Indeed. Here is the link: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B#usage-recommendations

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment