FuseAI/FuseO1-DeepSeekR1-QwQ-SkyT1-32B-Preview · Temperature's effect on the performance of long chain reasoning models. Why was 0.7 used for the evals?

FuseAI

In the evals for this model, a temperature of 0.7 was used. This is higher than what Deepseek recommends for R1, and what the Qwen team has set for it's QwQ space (the latter maybe unrepresentative as it is meant as a demo); the stated reason for using lower temperature values was to "prevent endless repetitions or incoherent outputs."
Therefore I'd like to ask if 0.7 was chosen for any specific reason, and if you have any observations on the effects of temperature and sample values on the performance of long chain reasoning models.

https://github.com/deepseek-ai/DeepSeek-R1
"0.5-0.7 (0.6 is recommended) "
https://huggingface.co/spaces/Qwen/QwQ-32B-preview/blob/main/app.py
"'temperature': 0.001, 'repetition_penalty': 1.0, "top_k": 20, "top_p": 0.8"