Update README.md
Browse files
README.md
CHANGED
@@ -176,40 +176,40 @@ This Ahma-3B-Instruct model was evaluated using [FIN-bench by TurkuNLP](https://
|
|
176 |
|
177 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
178 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
179 |
-
| Analogies | 50.77 | 48.46 |
|
180 |
-
| Arithmetic | 27.64 | 22.14 |
|
181 |
-
| Cause and Effect | 59.48 | 58.82 |
|
182 |
-
| Emotions | 36.25 | 28.12 |
|
183 |
-
| Empirical Judgements | 33.33 | 35.35 |
|
184 |
-
| General Knowledge | 44.29 | 48.57 |
|
185 |
-
| HHH Alignment | 42.09 | 41.66 |
|
186 |
-
| Intent Recognition | 24.42 | 26.16 |
|
187 |
-
| Misconceptions | 46.27 | 47.01 |
|
188 |
-
| Paraphrase | 59.50 | 73.00 |
|
189 |
-
| Sentence Ambiguity | 53.33 | 65.00 |
|
190 |
-
| Similarities Abstraction | 65.79 | 68.42 |
|
191 |
-
| **Non-Arithmetic Average** | **47.55** | **48.95** |
|
192 |
-
| **Overall Average** | **36.49** | **34.06** |
|
193 |
|
194 |
|
195 |
3-shot results:
|
196 |
|
197 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
198 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
199 |
-
| Analogies | 50.77 | 49.23 |
|
200 |
-
| Arithmetic | 38.38 | 43.89 |
|
201 |
-
| Cause and Effect | 60.78 | 64.71 |
|
202 |
-
| Emotions | 30.00 | 41.25 |
|
203 |
-
| Empirical Judgements | 46.46 | 44.44 |
|
204 |
-
| General Knowledge | 47.14 | 40.00 |
|
205 |
-
| HHH Alignment | 43.53 | 44.80 |
|
206 |
-
| Intent Recognition | 20.52 | 44.22 |
|
207 |
-
| Misconceptions | 50.75 | 52.24 |
|
208 |
-
| Paraphrase | 50.50 | 58.50 |
|
209 |
-
| Sentence Ambiguity | 53.33 | 48.33 |
|
210 |
-
| Similarities Abstraction | 69.74 | 72.37 |
|
211 |
-
| **Non-Arithmetic Average** | **48.48** | **51.49** |
|
212 |
-
| **Overall Average** | **42.87** | **47.27** |
|
213 |
|
214 |
As we can see, Ahma-3B-Instruct model outperforms 2X larger models like the FinGPT 8B and Viking 7B, especially in non-arithmetic tasks in 0-shot usage. Even the 10X larger Poro 34B model, which is generally better, doesn't show a huge performance difference considering its size, and Ahma-3B-Instruct actually surpasses it in some tasks.
|
215 |
|
@@ -221,31 +221,31 @@ This Ahma-3B-Instruct model was primarily evaluated using [MTBench Finnish by Lu
|
|
221 |
|
222 |
Single-turn results:
|
223 |
|
224 |
-
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct |
|
225 |
-
|
226 |
-
| Coding | 1.00 | 1.00
|
227 |
-
| Extraction | 2.00 | 1.30
|
228 |
-
| Humanities | 4.05 | 6.20
|
229 |
-
| Math | 3.00 | 3.20
|
230 |
-
| Reasoning | 2.90 | 4.60
|
231 |
-
| Roleplay | 4.80 | 6.50
|
232 |
-
| STEM | 5.10 | 5.95
|
233 |
-
| Writing | 6.60 | 9.00
|
234 |
-
| **Overall Average** | **3.68** | **4.72**
|
235 |
|
236 |
Multi-turn results:
|
237 |
|
238 |
-
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct | Poro 34B Chat |
|
239 |
-
|
240 |
-
| Coding | 1.00 | 1.00
|
241 |
-
| Extraction | 1.55 | 1.15
|
242 |
-
| Humanities | 3.25 | 6.20
|
243 |
-
| Math | 2.20 | 2.70
|
244 |
-
| Reasoning | 2.45 | 3.50
|
245 |
-
| Roleplay | 4.90 | 6.40
|
246 |
-
| STEM | 4.20 | 4.78
|
247 |
-
| Writing | 3.80 | 6.65
|
248 |
-
| **Overall Average** | **2.92** | **4.05**
|
249 |
|
250 |
|
251 |
As we can see, the Ahma-3B-Instruct model significantly improves upon the base Ahma-3B model, especially in tasks like writing. It's also worth noting that the Ahma-3B-Instruct model shows enhanced performance in multi-turn tasks compared to the base model, which highlights the value of the multi-turn training examples used in the fine-tuning process. The Ahma-3B-Instruct model lost 14% of its single-turn overall score in a multi-turn setting, while the base Ahma-3B model lost 21%. Therefore, this instruct model might be better suited for chat use cases as well. As expected, coding performance was poor since the Ahma models aren't trained on code data.
|
|
|
176 |
|
177 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
178 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
179 |
+
| Analogies | 50.77 | 48.46 | 56.92 | 41.54 | 49.23 | 40.00 | 54.62 |
|
180 |
+
| Arithmetic | 27.64 | 22.14 | 11.50 | 14.70 | 33.15 | 30.16 | 30.34 |
|
181 |
+
| Cause and Effect | 59.48 | 58.82 | 59.48 | 53.60 | 66.01 | 58.82 | 62.74 |
|
182 |
+
| Emotions | 36.25 | 28.12 | 36.25 | 27.50 | 22.50 | 26.25 | 35.63 |
|
183 |
+
| Empirical Judgements | 33.33 | 35.35 | 33.33 | 33.33 | 27.27 | 33.33 | 49.49 |
|
184 |
+
| General Knowledge | 44.29 | 48.57 | 51.43 | 37.14 | 40.00 | 24.29 | 51.43 |
|
185 |
+
| HHH Alignment | 42.09 | 41.66 | 44.23 | 43.22 | 41.81 | 42.51 | 42.92 |
|
186 |
+
| Intent Recognition | 24.42 | 26.16 | 43.64 | 56.94 | 17.49 | 22.40 | 68.35 |
|
187 |
+
| Misconceptions | 46.27 | 47.01 | 46.27 | 47.01 | 53.73 | 53.73 | 52.24 |
|
188 |
+
| Paraphrase | 59.50 | 73.00 | 67.00 | 70.50 | 51.00 | 50.00 | 51.00 |
|
189 |
+
| Sentence Ambiguity | 53.33 | 65.00 | 60.00 | 63.33 | 51.67 | 48.33 | 50.00 |
|
190 |
+
| Similarities Abstraction | 65.79 | 68.42 | 71.05 | 61.84 | 60.53 | 65.79 | 60.53 |
|
191 |
+
| **Non-Arithmetic Average** | **47.55** | **48.95** | **51.33** | **48.30** | **46.17** | **44.42** | **52.08** |
|
192 |
+
| **Overall Average** | **36.49** | **34.06** | **29.20** | **29.64** | **38.93** | **36.50** | **40.00** |
|
193 |
|
194 |
|
195 |
3-shot results:
|
196 |
|
197 |
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
|
198 |
|:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
|
199 |
+
| Analogies | 50.77 | 49.23 | 49.23 | 43.08 | 40.77 | 54.62 | 76.92 |
|
200 |
+
| Arithmetic | 38.38 | 43.89 | 20.88 | 26.81 | 43.63 | 45.78 | 53.68 |
|
201 |
+
| Cause and Effect | 60.78 | 64.71 | 66.01 | 62.74 | 64.05 | 58.17 | 67.32 |
|
202 |
+
| Emotions | 30.00 | 41.25 | 30.00 | 53.75 | 44.37 | 48.13 | 56.87 |
|
203 |
+
| Empirical Judgements | 46.46 | 44.44 | 39.39 | 39.39 | 32.32 | 43.43 | 63.64 |
|
204 |
+
| General Knowledge | 47.14 | 40.00 | 27.14 | 44.29 | 54.29 | 28.57 | 74.29 |
|
205 |
+
| HHH Alignment | 43.53 | 44.80 | 43.80 | 45.09 | 45.39 | 44.80 | 46.07 |
|
206 |
+
| Intent Recognition | 20.52 | 44.22 | 36.42 | 39.02 | 51.45 | 58.82 | 83.67 |
|
207 |
+
| Misconceptions | 50.75 | 52.24 | 46.27 | 51.49 | 52.99 | 46.27 | 52.99 |
|
208 |
+
| Paraphrase | 50.50 | 58.50 | 57.50 | 65.00 | 53.00 | 54.50 | 55.00 |
|
209 |
+
| Sentence Ambiguity | 53.33 | 48.33 | 53.33 | 51.67 | 51.67 | 53.33 | 66.67 |
|
210 |
+
| Similarities Abstraction | 69.74 | 72.37 | 72.37 | 69.74 | 64.47 | 73.68 | 75.00 |
|
211 |
+
| **Non-Arithmetic Average** | **48.48** | **51.49** | **49.05** | **51.63** | **51.19** | **50.94** | **61.96** |
|
212 |
+
| **Overall Average** | **42.87** | **47.27** | **33.41** | **37.84** | **46.99** | **48.07** | **57.36** |
|
213 |
|
214 |
As we can see, Ahma-3B-Instruct model outperforms 2X larger models like the FinGPT 8B and Viking 7B, especially in non-arithmetic tasks in 0-shot usage. Even the 10X larger Poro 34B model, which is generally better, doesn't show a huge performance difference considering its size, and Ahma-3B-Instruct actually surpasses it in some tasks.
|
215 |
|
|
|
221 |
|
222 |
Single-turn results:
|
223 |
|
224 |
+
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) |
|
225 |
+
|:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|
|
226 |
+
| Coding | 1.00 | 1.00 | 1.70 | 1.10 |
|
227 |
+
| Extraction | 2.00 | 1.30 | 3.10 | 3.00 |
|
228 |
+
| Humanities | 4.05 | 6.20 | 6.60 | 8.00 |
|
229 |
+
| Math | 3.00 | 3.20 | 3.90 | 2.90 |
|
230 |
+
| Reasoning | 2.90 | 4.60 | 3.70 | 5.70 |
|
231 |
+
| Roleplay | 4.80 | 6.50 | 6.60 | 7.20 |
|
232 |
+
| STEM | 5.10 | 5.95 | 6.75 | 7.30 |
|
233 |
+
| Writing | 6.60 | 9.00 | 7.10 | 8.80 |
|
234 |
+
| **Overall Average** | **3.68** | **4.72** | **4.93** | **5.50** |
|
235 |
|
236 |
Multi-turn results:
|
237 |
|
238 |
+
| Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | Poro 34B Chat |
|
239 |
+
|:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:--------------|
|
240 |
+
| Coding | 1.00 | 1.00 | 1.40 | 1.05 | 3.70 |
|
241 |
+
| Extraction | 1.55 | 1.15 | 2.05 | 2.65 | 6.37 |
|
242 |
+
| Humanities | 3.25 | 6.20 | 4.95 | 7.85 | 9.25 |
|
243 |
+
| Math | 2.20 | 2.70 | 2.50 | 2.40 | 1.20 |
|
244 |
+
| Reasoning | 2.45 | 3.50 | 2.55 | 4.50 | 4.35 |
|
245 |
+
| Roleplay | 4.90 | 6.40 | 6.35 | 6.60 | 7.35 |
|
246 |
+
| STEM | 4.20 | 4.78 | 4.28 | 5.40 | 7.80 |
|
247 |
+
| Writing | 3.80 | 6.65 | 4.10 | 6.25 | 8.50 |
|
248 |
+
| **Overall Average** | **2.92** | **4.05** | **3.52** | **4.59** | **6.06** |
|
249 |
|
250 |
|
251 |
As we can see, the Ahma-3B-Instruct model significantly improves upon the base Ahma-3B model, especially in tasks like writing. It's also worth noting that the Ahma-3B-Instruct model shows enhanced performance in multi-turn tasks compared to the base model, which highlights the value of the multi-turn training examples used in the fine-tuning process. The Ahma-3B-Instruct model lost 14% of its single-turn overall score in a multi-turn setting, while the base Ahma-3B model lost 21%. Therefore, this instruct model might be better suited for chat use cases as well. As expected, coding performance was poor since the Ahma models aren't trained on code data.
|