Update README.md
Browse files
README.md
CHANGED
@@ -100,6 +100,34 @@ deepspeed: ./deepspeed_configs/zero3_bf16.json
|
|
100 |
|
101 |
</details>
|
102 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
103 |
### Example
|
104 |
|
105 |
#### Q1: From [ChuGyouk/GSM8k-Ko](https://huggingface.co/datasets/ChuGyouk/GSM8k-Ko)
|
|
|
100 |
|
101 |
</details>
|
102 |
|
103 |
+
### Evaluation
|
104 |
+
|
105 |
+
#### [HAERAE-HUB/HRM8K](https://huggingface.co/datasets/HAERAE-HUB/HRM8K)
|
106 |
+
|
107 |
+
- werty1248/EXAONE-3.5-7.8B-Stratos-Ko: temperature=0.0, max think tokens = 4096
|
108 |
+
- In the first stage, the model generates up to 4096 tokens.
|
109 |
+
- In the second stage, for answers that fail to generate an ```<|end_of_thought|>``` after using the maximum number of tokens, I add ```\n\n<|end_of_thought|>\n\n<|begin_of_solution|>``` to end the thought and output the answer.
|
110 |
+
- [LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct](https://huggingface.co/LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct): temperature=0.7, top_p=0.95, max_tokens = 2048
|
111 |
+
- Other Models: Reported by [OneLineAI](https://www.onelineai.com/blog/hrm8k) - Creator of HRM8K Benchmark
|
112 |
+
|
113 |
+
| Model | GSM8K | KSM | MATH | MMMLU | OMNI_MATH | Average |
|
114 |
+
|-------|-------|-------|-------|-------|-------|-------|
|
115 |
+
| GPT-4o | 91.21 | 22.83 | 74.45 | 68.72 | 30.75 | 57.99 |
|
116 |
+
| GPT-4o-mini | 87.57 | 19.40 | 70.68 | 63.40 | 26.45 | 53.50 |
|
117 |
+
| *EXAONE-3.5-7.8B-Stratos-Ko | 83.02 | 15.97 | 67.49 | **44.68 | 24.62 | 49.98 |
|
118 |
+
| EXAONE-3.5-7.8B-Instruct | 81.58 | 14.71 | 63.50 | ***41.49? | 21.69 | 44.19 |
|
119 |
+
| Qwen2.5-14B-Instruct | 66.34 | 15.55 | 53.38 | 61.49 | 20.64 | 43.88 |
|
120 |
+
| Llama-3.1-8B-Instruct | 77.79 | 7.21 | 49.01 | 47.02 | 15.92 | 39.39 |
|
121 |
+
| Qwen2.5-7B-Instruct | 58.38 | 13.10 | 48.04 | 48.94 | 16.55 | 37.80 |
|
122 |
+
| EXAONE-3.0-7.8B-Instruct | 72.33 | 7.98 | 46.79 | 37.66 | 15.35 | 36.02 |
|
123 |
+
| *Ko-R1-1.5B-preview | 43.3 | ? | 73.1 | ? | 29.8 | ? |
|
124 |
+
|
125 |
+
\* Korean Reasoning Models
|
126 |
+
|
127 |
+
\*\* In a 4-option multiple-choice question, both correct answer numbber and answer alphabet are accepted as correct ('A' == 1, 'B' == 2, ...).
|
128 |
+
|
129 |
+
\*\*\* In a 4-option multiple-choice question, the correct **answer** *(not answer number) are accepted as correct.
|
130 |
+
|
131 |
### Example
|
132 |
|
133 |
#### Q1: From [ChuGyouk/GSM8k-Ko](https://huggingface.co/datasets/ChuGyouk/GSM8k-Ko)
|