prithivMLmods commited on
Commit
04b662c
·
verified ·
1 Parent(s): 18fa5a1

Adding Evaluation Results

Browse files

This is an automated PR created with [this space](https://huggingface.co/spaces/T145/open-llm-leaderboard-results-to-modelcard)!

The purpose of this PR is to add evaluation results from the Open LLM Leaderboard to your model card.

Please report any issues here: https://huggingface.co/spaces/T145/open-llm-leaderboard-results-to-modelcard/discussions

Files changed (1) hide show
  1. README.md +114 -1
README.md CHANGED
@@ -16,6 +16,105 @@ datasets:
16
  - prithivMLmods/Math-Solve
17
  - amphora/QwQ-LongCoT-130K
18
  - prithivMLmods/Deepthink-Reasoning
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ---
20
 
21
 
@@ -80,4 +179,18 @@ The QwQ-LCoT2-7B-Instruct model is designed for advanced reasoning and instructi
80
  3. **Complexity Ceiling**: While optimized for multi-step reasoning, exceedingly complex or abstract problems may result in incomplete or incorrect outputs.
81
  4. **Dependency on Prompt Quality**: The quality and specificity of the user prompt heavily influence the model's responses.
82
  5. **Non-Factual Outputs**: Despite being fine-tuned for reasoning, the model can still generate hallucinated or factually inaccurate content, particularly for niche or unverified topics.
83
- 6. **Computational Requirements**: Running the model effectively requires significant computational resources, particularly when generating long sequences or handling high-concurrency workloads.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  - prithivMLmods/Math-Solve
17
  - amphora/QwQ-LongCoT-130K
18
  - prithivMLmods/Deepthink-Reasoning
19
+ model-index:
20
+ - name: QwQ-LCoT2-7B-Instruct
21
+ results:
22
+ - task:
23
+ type: text-generation
24
+ name: Text Generation
25
+ dataset:
26
+ name: IFEval (0-Shot)
27
+ type: wis-k/instruction-following-eval
28
+ split: train
29
+ args:
30
+ num_few_shot: 0
31
+ metrics:
32
+ - type: inst_level_strict_acc and prompt_level_strict_acc
33
+ value: 55.76
34
+ name: averaged accuracy
35
+ source:
36
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT2-7B-Instruct
37
+ name: Open LLM Leaderboard
38
+ - task:
39
+ type: text-generation
40
+ name: Text Generation
41
+ dataset:
42
+ name: BBH (3-Shot)
43
+ type: SaylorTwift/bbh
44
+ split: test
45
+ args:
46
+ num_few_shot: 3
47
+ metrics:
48
+ - type: acc_norm
49
+ value: 34.37
50
+ name: normalized accuracy
51
+ source:
52
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT2-7B-Instruct
53
+ name: Open LLM Leaderboard
54
+ - task:
55
+ type: text-generation
56
+ name: Text Generation
57
+ dataset:
58
+ name: MATH Lvl 5 (4-Shot)
59
+ type: lighteval/MATH-Hard
60
+ split: test
61
+ args:
62
+ num_few_shot: 4
63
+ metrics:
64
+ - type: exact_match
65
+ value: 22.21
66
+ name: exact match
67
+ source:
68
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT2-7B-Instruct
69
+ name: Open LLM Leaderboard
70
+ - task:
71
+ type: text-generation
72
+ name: Text Generation
73
+ dataset:
74
+ name: GPQA (0-shot)
75
+ type: Idavidrein/gpqa
76
+ split: train
77
+ args:
78
+ num_few_shot: 0
79
+ metrics:
80
+ - type: acc_norm
81
+ value: 6.38
82
+ name: acc_norm
83
+ source:
84
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT2-7B-Instruct
85
+ name: Open LLM Leaderboard
86
+ - task:
87
+ type: text-generation
88
+ name: Text Generation
89
+ dataset:
90
+ name: MuSR (0-shot)
91
+ type: TAUR-Lab/MuSR
92
+ args:
93
+ num_few_shot: 0
94
+ metrics:
95
+ - type: acc_norm
96
+ value: 15.75
97
+ name: acc_norm
98
+ source:
99
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT2-7B-Instruct
100
+ name: Open LLM Leaderboard
101
+ - task:
102
+ type: text-generation
103
+ name: Text Generation
104
+ dataset:
105
+ name: MMLU-PRO (5-shot)
106
+ type: TIGER-Lab/MMLU-Pro
107
+ config: main
108
+ split: test
109
+ args:
110
+ num_few_shot: 5
111
+ metrics:
112
+ - type: acc
113
+ value: 37.13
114
+ name: accuracy
115
+ source:
116
+ url: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=prithivMLmods%2FQwQ-LCoT2-7B-Instruct
117
+ name: Open LLM Leaderboard
118
  ---
119
 
120
 
 
179
  3. **Complexity Ceiling**: While optimized for multi-step reasoning, exceedingly complex or abstract problems may result in incomplete or incorrect outputs.
180
  4. **Dependency on Prompt Quality**: The quality and specificity of the user prompt heavily influence the model's responses.
181
  5. **Non-Factual Outputs**: Despite being fine-tuned for reasoning, the model can still generate hallucinated or factually inaccurate content, particularly for niche or unverified topics.
182
+ 6. **Computational Requirements**: Running the model effectively requires significant computational resources, particularly when generating long sequences or handling high-concurrency workloads.
183
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard)
184
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/prithivMLmods__QwQ-LCoT2-7B-Instruct-details)!
185
+ Summarized results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/contents/viewer/default/train?q=prithivMLmods%2FQwQ-LCoT2-7B-Instruct&sort[column]=Average%20%E2%AC%86%EF%B8%8F&sort[direction]=desc)!
186
+
187
+ | Metric |Value (%)|
188
+ |-------------------|--------:|
189
+ |**Average** | 28.60|
190
+ |IFEval (0-Shot) | 55.76|
191
+ |BBH (3-Shot) | 34.37|
192
+ |MATH Lvl 5 (4-Shot)| 22.21|
193
+ |GPQA (0-shot) | 6.38|
194
+ |MuSR (0-shot) | 15.75|
195
+ |MMLU-PRO (5-shot) | 37.13|
196
+