aapot commited on
Commit
cba5ac0
·
verified ·
1 Parent(s): 0a3c4de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -50
README.md CHANGED
@@ -176,40 +176,40 @@ This Ahma-3B-Instruct model was evaluated using [FIN-bench by TurkuNLP](https://
176
 
177
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
178
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
179
- | Analogies | 50.77 | 48.46 | TBA | TBA | 49.23 | 40.00 | 54.62 |
180
- | Arithmetic | 27.64 | 22.14 | TBA | TBA | 33.15 | 30.16 | 30.34 |
181
- | Cause and Effect | 59.48 | 58.82 | TBA | TBA | 66.01 | 58.82 | 62.74 |
182
- | Emotions | 36.25 | 28.12 | TBA | TBA | 22.50 | 26.25 | 35.63 |
183
- | Empirical Judgements | 33.33 | 35.35 | TBA | TBA | 27.27 | 33.33 | 49.49 |
184
- | General Knowledge | 44.29 | 48.57 | TBA | TBA | 40.00 | 24.29 | 51.43 |
185
- | HHH Alignment | 42.09 | 41.66 | TBA | TBA | 41.81 | 42.51 | 42.92 |
186
- | Intent Recognition | 24.42 | 26.16 | TBA | TBA | 17.49 | 22.40 | 68.35 |
187
- | Misconceptions | 46.27 | 47.01 | TBA | TBA | 53.73 | 53.73 | 52.24 |
188
- | Paraphrase | 59.50 | 73.00 | TBA | TBA | 51.00 | 50.00 | 51.00 |
189
- | Sentence Ambiguity | 53.33 | 65.00 | TBA | TBA | 51.67 | 48.33 | 50.00 |
190
- | Similarities Abstraction | 65.79 | 68.42 | TBA | TBA | 60.53 | 65.79 | 60.53 |
191
- | **Non-Arithmetic Average** | **47.55** | **48.95** | TBA | TBA | **46.17** | **44.42** | **52.08** |
192
- | **Overall Average** | **36.49** | **34.06** | TBA | TBA | **38.93** | **36.50** | **40.00** |
193
 
194
 
195
  3-shot results:
196
 
197
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
198
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
199
- | Analogies | 50.77 | 49.23 | TBA | TBA | 40.77 | 54.62 | 76.92 |
200
- | Arithmetic | 38.38 | 43.89 | TBA | TBA | 43.63 | 45.78 | 53.68 |
201
- | Cause and Effect | 60.78 | 64.71 | TBA | TBA | 64.05 | 58.17 | 67.32 |
202
- | Emotions | 30.00 | 41.25 | TBA | TBA | 44.37 | 48.13 | 56.87 |
203
- | Empirical Judgements | 46.46 | 44.44 | TBA | TBA | 32.32 | 43.43 | 63.64 |
204
- | General Knowledge | 47.14 | 40.00 | TBA | TBA | 54.29 | 28.57 | 74.29 |
205
- | HHH Alignment | 43.53 | 44.80 | TBA | TBA | 45.39 | 44.80 | 46.07 |
206
- | Intent Recognition | 20.52 | 44.22 | TBA | TBA | 51.45 | 58.82 | 83.67 |
207
- | Misconceptions | 50.75 | 52.24 | TBA | TBA | 52.99 | 46.27 | 52.99 |
208
- | Paraphrase | 50.50 | 58.50 | TBA | TBA | 53.00 | 54.50 | 55.00 |
209
- | Sentence Ambiguity | 53.33 | 48.33 | TBA | TBA | 51.67 | 53.33 | 66.67 |
210
- | Similarities Abstraction | 69.74 | 72.37 | TBA | TBA | 64.47 | 73.68 | 75.00 |
211
- | **Non-Arithmetic Average** | **48.48** | **51.49** | TBA | TBA | **51.19** | **50.94** | **61.96** |
212
- | **Overall Average** | **42.87** | **47.27** | TBA | TBA | **46.99** | **48.07** | **57.36** |
213
 
214
  As we can see, Ahma-3B-Instruct model outperforms 2X larger models like the FinGPT 8B and Viking 7B, especially in non-arithmetic tasks in 0-shot usage. Even the 10X larger Poro 34B model, which is generally better, doesn't show a huge performance difference considering its size, and Ahma-3B-Instruct actually surpasses it in some tasks.
215
 
@@ -221,31 +221,31 @@ This Ahma-3B-Instruct model was primarily evaluated using [MTBench Finnish by Lu
221
 
222
  Single-turn results:
223
 
224
- | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct |
225
- |:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|
226
- | Coding | 1.00 | 1.00 | TBA | TBA |
227
- | Extraction | 2.00 | 1.30 | TBA | TBA |
228
- | Humanities | 4.05 | 6.20 | TBA | TBA |
229
- | Math | 3.00 | 3.20 | TBA | TBA |
230
- | Reasoning | 2.90 | 4.60 | TBA | TBA |
231
- | Roleplay | 4.80 | 6.50 | TBA | TBA |
232
- | STEM | 5.10 | 5.95 | TBA | TBA |
233
- | Writing | 6.60 | 9.00 | TBA | TBA |
234
- | **Overall Average** | **3.68** | **4.72** | TBA | TBA |
235
 
236
  Multi-turn results:
237
 
238
- | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct | Poro 34B Chat |
239
- |:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|:--------------|
240
- | Coding | 1.00 | 1.00 | TBA | TBA | 3.70 |
241
- | Extraction | 1.55 | 1.15 | TBA | TBA | 6.37 |
242
- | Humanities | 3.25 | 6.20 | TBA | TBA | 9.25 |
243
- | Math | 2.20 | 2.70 | TBA | TBA | 1.20 |
244
- | Reasoning | 2.45 | 3.50 | TBA | TBA | 4.35 |
245
- | Roleplay | 4.90 | 6.40 | TBA | TBA | 7.35 |
246
- | STEM | 4.20 | 4.78 | TBA | TBA | 7.80 |
247
- | Writing | 3.80 | 6.65 | TBA | TBA | 8.50 |
248
- | **Overall Average** | **2.92** | **4.05** | TBA | TBA | **6.06** |
249
 
250
 
251
  As we can see, the Ahma-3B-Instruct model significantly improves upon the base Ahma-3B model, especially in tasks like writing. It's also worth noting that the Ahma-3B-Instruct model shows enhanced performance in multi-turn tasks compared to the base model, which highlights the value of the multi-turn training examples used in the fine-tuning process. The Ahma-3B-Instruct model lost 14% of its single-turn overall score in a multi-turn setting, while the base Ahma-3B model lost 21%. Therefore, this instruct model might be better suited for chat use cases as well. As expected, coding performance was poor since the Ahma models aren't trained on code data.
 
176
 
177
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
178
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
179
+ | Analogies | 50.77 | 48.46 | 56.92 | 41.54 | 49.23 | 40.00 | 54.62 |
180
+ | Arithmetic | 27.64 | 22.14 | 11.50 | 14.70 | 33.15 | 30.16 | 30.34 |
181
+ | Cause and Effect | 59.48 | 58.82 | 59.48 | 53.60 | 66.01 | 58.82 | 62.74 |
182
+ | Emotions | 36.25 | 28.12 | 36.25 | 27.50 | 22.50 | 26.25 | 35.63 |
183
+ | Empirical Judgements | 33.33 | 35.35 | 33.33 | 33.33 | 27.27 | 33.33 | 49.49 |
184
+ | General Knowledge | 44.29 | 48.57 | 51.43 | 37.14 | 40.00 | 24.29 | 51.43 |
185
+ | HHH Alignment | 42.09 | 41.66 | 44.23 | 43.22 | 41.81 | 42.51 | 42.92 |
186
+ | Intent Recognition | 24.42 | 26.16 | 43.64 | 56.94 | 17.49 | 22.40 | 68.35 |
187
+ | Misconceptions | 46.27 | 47.01 | 46.27 | 47.01 | 53.73 | 53.73 | 52.24 |
188
+ | Paraphrase | 59.50 | 73.00 | 67.00 | 70.50 | 51.00 | 50.00 | 51.00 |
189
+ | Sentence Ambiguity | 53.33 | 65.00 | 60.00 | 63.33 | 51.67 | 48.33 | 50.00 |
190
+ | Similarities Abstraction | 65.79 | 68.42 | 71.05 | 61.84 | 60.53 | 65.79 | 60.53 |
191
+ | **Non-Arithmetic Average** | **47.55** | **48.95** | **51.33** | **48.30** | **46.17** | **44.42** | **52.08** |
192
+ | **Overall Average** | **36.49** | **34.06** | **29.20** | **29.64** | **38.93** | **36.50** | **40.00** |
193
 
194
 
195
  3-shot results:
196
 
197
  | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
198
  |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
199
+ | Analogies | 50.77 | 49.23 | 49.23 | 43.08 | 40.77 | 54.62 | 76.92 |
200
+ | Arithmetic | 38.38 | 43.89 | 20.88 | 26.81 | 43.63 | 45.78 | 53.68 |
201
+ | Cause and Effect | 60.78 | 64.71 | 66.01 | 62.74 | 64.05 | 58.17 | 67.32 |
202
+ | Emotions | 30.00 | 41.25 | 30.00 | 53.75 | 44.37 | 48.13 | 56.87 |
203
+ | Empirical Judgements | 46.46 | 44.44 | 39.39 | 39.39 | 32.32 | 43.43 | 63.64 |
204
+ | General Knowledge | 47.14 | 40.00 | 27.14 | 44.29 | 54.29 | 28.57 | 74.29 |
205
+ | HHH Alignment | 43.53 | 44.80 | 43.80 | 45.09 | 45.39 | 44.80 | 46.07 |
206
+ | Intent Recognition | 20.52 | 44.22 | 36.42 | 39.02 | 51.45 | 58.82 | 83.67 |
207
+ | Misconceptions | 50.75 | 52.24 | 46.27 | 51.49 | 52.99 | 46.27 | 52.99 |
208
+ | Paraphrase | 50.50 | 58.50 | 57.50 | 65.00 | 53.00 | 54.50 | 55.00 |
209
+ | Sentence Ambiguity | 53.33 | 48.33 | 53.33 | 51.67 | 51.67 | 53.33 | 66.67 |
210
+ | Similarities Abstraction | 69.74 | 72.37 | 72.37 | 69.74 | 64.47 | 73.68 | 75.00 |
211
+ | **Non-Arithmetic Average** | **48.48** | **51.49** | **49.05** | **51.63** | **51.19** | **50.94** | **61.96** |
212
+ | **Overall Average** | **42.87** | **47.27** | **33.41** | **37.84** | **46.99** | **48.07** | **57.36** |
213
 
214
  As we can see, Ahma-3B-Instruct model outperforms 2X larger models like the FinGPT 8B and Viking 7B, especially in non-arithmetic tasks in 0-shot usage. Even the 10X larger Poro 34B model, which is generally better, doesn't show a huge performance difference considering its size, and Ahma-3B-Instruct actually surpasses it in some tasks.
215
 
 
221
 
222
  Single-turn results:
223
 
224
+ | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) |
225
+ |:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|
226
+ | Coding | 1.00 | 1.00 | 1.70 | 1.10 |
227
+ | Extraction | 2.00 | 1.30 | 3.10 | 3.00 |
228
+ | Humanities | 4.05 | 6.20 | 6.60 | 8.00 |
229
+ | Math | 3.00 | 3.20 | 3.90 | 2.90 |
230
+ | Reasoning | 2.90 | 4.60 | 3.70 | 5.70 |
231
+ | Roleplay | 4.80 | 6.50 | 6.60 | 7.20 |
232
+ | STEM | 5.10 | 5.95 | 6.75 | 7.30 |
233
+ | Writing | 6.60 | 9.00 | 7.10 | 8.80 |
234
+ | **Overall Average** | **3.68** | **4.72** | **4.93** | **5.50** |
235
 
236
  Multi-turn results:
237
 
238
+ | Benchmark | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | Poro 34B Chat |
239
+ |:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:--------------|
240
+ | Coding | 1.00 | 1.00 | 1.40 | 1.05 | 3.70 |
241
+ | Extraction | 1.55 | 1.15 | 2.05 | 2.65 | 6.37 |
242
+ | Humanities | 3.25 | 6.20 | 4.95 | 7.85 | 9.25 |
243
+ | Math | 2.20 | 2.70 | 2.50 | 2.40 | 1.20 |
244
+ | Reasoning | 2.45 | 3.50 | 2.55 | 4.50 | 4.35 |
245
+ | Roleplay | 4.90 | 6.40 | 6.35 | 6.60 | 7.35 |
246
+ | STEM | 4.20 | 4.78 | 4.28 | 5.40 | 7.80 |
247
+ | Writing | 3.80 | 6.65 | 4.10 | 6.25 | 8.50 |
248
+ | **Overall Average** | **2.92** | **4.05** | **3.52** | **4.59** | **6.06** |
249
 
250
 
251
  As we can see, the Ahma-3B-Instruct model significantly improves upon the base Ahma-3B model, especially in tasks like writing. It's also worth noting that the Ahma-3B-Instruct model shows enhanced performance in multi-turn tasks compared to the base model, which highlights the value of the multi-turn training examples used in the fine-tuning process. The Ahma-3B-Instruct model lost 14% of its single-turn overall score in a multi-turn setting, while the base Ahma-3B model lost 21%. Therefore, this instruct model might be better suited for chat use cases as well. As expected, coding performance was poor since the Ahma models aren't trained on code data.