Temperature's effect on the performance of long chain reasoning models. Why was 0.7 used for the evals?
In the evals for this model, a temperature of 0.7 was used. This is higher than what Deepseek recommends for R1, and what the Qwen team has set for it's QwQ space (the latter maybe unrepresentative as it is meant as a demo); the stated reason for using lower temperature values was to "prevent endless repetitions or incoherent outputs."
Therefore I'd like to ask if 0.7 was chosen for any specific reason, and if you have any observations on the effects of temperature and sample values on the performance of long chain reasoning models.
https://github.com/deepseek-ai/DeepSeek-R1
"0.5-0.7 (0.6 is recommended) "
https://huggingface.co/spaces/Qwen/QwQ-32B-preview/blob/main/app.py
"'temperature': 0.001, 'repetition_penalty': 1.0, "top_k": 20, "top_p": 0.8"
Thank you for your question!
For math and code evaluation, we use a temperature of 0.6 to ensure consistency with DeepSeek-R1.
For science evaluation, we finish this part long times ago with a temperature of 0.7. However, we have no enough GPU for this preview version to re-run this part with a temperature of 0.6.