System Prompt

by Wanfq - opened Jan 20

Jan 20

We have tested the system prompt with temperature of 0.7.

You are a helpful and harmless assistant. You should think step-by-step.

Here are the evaluation results.

Models	AIME24	MATH500	GSM8K	GPQA-Diamond	ARC-Challenge	MMLU-Pro	MMLU	LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	46.67	88.20	-	57.58	-	-	-	-

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

MonolithFoundation

Jan 21

The result seems biased.

Wanfq

Jan 21

I'm also confused to get this much lower results compared to their reported, especially on AIME24...

MonolithFoundation

Jan 21

Wanfq

Jan 21

We have tested the system prompt with temperature of 0.7.
You are a helpful and harmless assistant. You should think step-by-step.
Here are the evaluation results.

Models AIME24 MATH500 GSM8K GPQA-Diamond ARC-Challenge MMLU-Pro MMLU LiveCodeBench

deepseek-ai/DeepSeek-R1-Distill-Qwen-32B 46.67 88.20 - 57.58 - - - -

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

The evaluation code is modified from SkyThought. In our evaluation, we set the temperature to 0.7 and the max_tokens to 32768. We provide the example to reproduce our results in evaluation.

The system prompt for evaluation is set to:

You are a helpful and harmless assistant. You should think step-by-step.

We are currently attempting to reproduce the results reported in the DeepSeek-R1 paper by experimenting with different system prompts. We will update our findings once we have acquired the original system prompt used in their study.

The updated evaluation results are presented here:

Models	AIME24	MATH500	GSM8K	GPQA-Diamond	ARC-Challenge	MMLU-Pro	MMLU	LiveCodeBench
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	46.67	88.20	93.71	57.58	95.90	68.70	82.17	59.69

More evaluation results can be found at https://huggingface.co/FuseAI/FuseO1-DeekSeekR1-QwQ-SkyT1-32B-Preview

Bennixzp2024

Jan 21

•

edited Jan 21

如何评价

guoday

DeepSeek org Jan 21

Please kindly refer to the following link:
https://github.com/deepseek-ai/DeepSeek-R1#usage-recommendations

x5wow

Jan 22

Покажи пример Telegram bot, python aiogram принимающий видео от юзера, обрабатывающий его и отправляющий обратно. База данных mongodb

x5wow

Jan 22

Сделай

urtuuuu

Jan 22

Can you confirm that this model can't answer the coding question which even standard qwen-7b-instruct answers?

Explain the bug in the following code:

from time import sleep
from multiprocessing.pool import ThreadPool
 
def task():
    sleep(1)
    return 'all done'

if __name__ == '__main__':
    with ThreadPool() as pool:
        result = pool.apply_async(task())
        value = result.get()
        print(value)

After long thinking it always answers that there is no bug. But the bug is in result = pool.apply_async(task). Almost all recent models of similar size answer it easily.

Wanfq

Jan 23

•

edited Jan 23

We find the evaluation results for math and code are not correct in our current version. To address this issue, we use the code from Qwen2.5-Math and Qwen2.5-Coder for math and code evaluation. With this approach, we have successfully reproduced the results reported in the DeepSeek-R1 paper.

We have finished all the evaluation and updated the results here:

The reproduce details can be found in our blog: https://huggingface.co/blog/Wanfq/fuseo1-preview

We also provide the code in our github repo: https://github.com/fanqiwan/FuseAI/tree/main/FuseO1-Preview

Our models are in : https://huggingface.co/collections/FuseAI/fuseo1-preview-678eb56093649b2688bc9977

Have fun!

haili-tian

21 days ago

They do not recommend use system_prompt

mathcrazyy

20 days ago

•

edited 20 days ago

i use this code: https://github.com/TIGER-AI-Lab/CritiqueFineTuning, and get the result below:

mathcrazyy

18 days ago

•

edited 18 days ago

prompt:
"Please reason step by step, and put your final answer within \boxed{{}}.(Don't make your reasoning too long)\nUser: {input}\nAssistant: 《think》\n"
(把《》替换为<>，我这里使用会被吞掉)
max_lenght=16384

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment