System Prompt

#2
by Wanfq - opened

What is the system prompt for the distilled model?

I'm using the QwQ system prompt, seems works just fine

You are a helpful and harmless assistant. You are Qwen developed by Alibaba. You should think step-by-step.

I have a quick example output with system/user prompts and benchmarks for 3090TI FE running a bnb-4bit quant locally posted at r/LocalLLaMA

I am still experimenting as even with temperature in the suggested range 0.5~0.8 it can get hung up repeatedly second guessing itself in a loop.

We have tested the system prompt with temperature of 0.7.

You are a helpful and harmless assistant. You should think step-by-step.

Here are the evaluation results.

Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 - 57.58 - - - -

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

@Wanfq wow, you guys are fast, i see you just released a merge FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview today?!

the benchmark numbers on your merge are looking good! I was using Sky-T1-32B up until today when DeepSeek-R1-Distill-Qwen-32B landed.

can't wait to try out your merge after the GGUFs land! Though been a busy day for @bartowski already... haha! 🎉

cheers!

@Wanfq well those scores are significantly lower than deepseek's, I wonder if they included how they setup the test environment in the paper

Have you tried "no system prompt"? Since deepseek V3 also never got any official system prompt, maybe their new models performs the best without any system prompt

DeepSeek-R1-Evaluation

For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.

Distilled Model Evaluation

Model AIME 2024 pass@1 AIME 2024 cons@64
DeepSeek-R1-Distill-Qwen-32B 72.6 83.3

cons@64 (majority voting by 64 model calls) and pass@1 (success rate in one call) No inference-time search or sampling

Have you tried "no system prompt"? Since deepseek V3 also never got any official system prompt, maybe their new models performs the best without any system prompt

I have tryed no system prompt in my early attempt. The results are close to the "You are a helpful and harmless assistant. You should think step-by-step."

DeepSeek-R1-Evaluation

For all our models, the maximum generation length is set to 32,768 tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 64 responses per query to estimate pass@1.

Distilled Model Evaluation

Model AIME 2024 pass@1 AIME 2024 cons@64
DeepSeek-R1-Distill-Qwen-32B 72.6 83.3

cons@64 (majority voting by 64 model calls) and pass@1 (success rate in one call) No inference-time search or sampling

We use a temperature of 0.7, a maximum length of 32768, and the evaluation code is based on https://github.com/NovaSky-AI/SkyThought to calculate the pass@1.

We have tested the system prompt with temperature of 0.7.

You are a helpful and harmless assistant. You should think step-by-step.

Here are the evaluation results.

Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 - 57.58 - - - -

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

In the readme: DeepSeek-R1-Distill models can be utilized in the same manner as Qwen or Llama models.
Maybe you can try to set the system as "You are a helpful assistant.", it's the same as Qwen2.5.
@Wanfq

We have tested the system prompt with temperature of 0.7.

You are a helpful and harmless assistant. You should think step-by-step.

Here are the evaluation results.

Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 - 57.58 - - - -

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

The evaluation code is modified from SkyThought. In our evaluation, we set the temperature to 0.7 and the max_tokens to 32768. We provide the example to reproduce our results in evaluation.

The system prompt for evaluation is set to:

You are a helpful and harmless assistant. You should think step-by-step.

We are currently attempting to reproduce the results reported in the DeepSeek-R1 paper by experimenting with different system prompts. We will update our findings once we have acquired the original system prompt used in their study.

The updated evaluation results are presented here:

Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 93.71 57.58 95.90 68.70 82.17 59.69

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

image.png

The paper says they used a Top P of 0.95, and temp of 0.6 for their benchmarks.
How have people been setting Top K, Repeat Penalty, and Min P to get the best results?

Sign up or log in to comment