Finnish-NLP
/

Ahma-3B-Instruct

@@ -176,40 +176,40 @@ This Ahma-3B-Instruct model was evaluated using [FIN-bench by TurkuNLP](https://
 | Benchmark                  | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
 |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
-| Analogies                  | 50.77                                 | 48.46                                     | TBA                                   | TBA                                       | 49.23     | 40.00     | 54.62                 |
-| Arithmetic                 | 27.64                                 | 22.14                                     | TBA                                   | TBA                                       | 33.15     | 30.16     | 30.34                 |
-| Cause and Effect           | 59.48                                 | 58.82                                     | TBA                                   | TBA                                       | 66.01     | 58.82     | 62.74                 |
-| Emotions                   | 36.25                                 | 28.12                                     | TBA                                   | TBA                                       | 22.50     | 26.25     | 35.63                 |
-| Empirical Judgements       | 33.33                                 | 35.35                                     | TBA                                   | TBA                                       | 27.27     | 33.33     | 49.49                 |
-| General Knowledge          | 44.29                                 | 48.57                                     | TBA                                   | TBA                                       | 40.00     | 24.29     | 51.43                 |
-| HHH Alignment              | 42.09                                 | 41.66                                     | TBA                                   | TBA                                       | 41.81     | 42.51     | 42.92                 |
-| Intent Recognition         | 24.42                                 | 26.16                                     | TBA                                   | TBA                                       | 17.49     | 22.40     | 68.35                 |
-| Misconceptions             | 46.27                                 | 47.01                                     | TBA                                   | TBA                                       | 53.73     | 53.73     | 52.24                 |
-| Paraphrase                 | 59.50                                 | 73.00                                     | TBA                                   | TBA                                       | 51.00     | 50.00     | 51.00                 |
-| Sentence Ambiguity         | 53.33                                 | 65.00                                     | TBA                                   | TBA                                       | 51.67     | 48.33     | 50.00                 |
-| Similarities Abstraction   | 65.79                                 | 68.42                                     | TBA                                   | TBA                                       | 60.53     | 65.79     | 60.53                 |
-| **Non-Arithmetic Average** | **47.55**                             | **48.95**                                 | TBA                                   | TBA                                       | **46.17** | **44.42** | **52.08**             |
-| **Overall Average**        | **36.49**                             | **34.06**                                 | TBA                                   | TBA                                       | **38.93** | **36.50** | **40.00**             |
 3-shot results:
 | Benchmark                  | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
 |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
-| Analogies                  | 50.77                                 | 49.23                                     | TBA                                   | TBA                                       | 40.77     | 54.62     | 76.92                 |
-| Arithmetic                 | 38.38                                 | 43.89                                     | TBA                                   | TBA                                       | 43.63     | 45.78     | 53.68                 |
-| Cause and Effect           | 60.78                                 | 64.71                                     | TBA                                   | TBA                                       | 64.05     | 58.17     | 67.32                 |
-| Emotions                   | 30.00                                 | 41.25                                     | TBA                                   | TBA                                       | 44.37     | 48.13     | 56.87                 |
-| Empirical Judgements       | 46.46                                 | 44.44                                     | TBA                                   | TBA                                       | 32.32     | 43.43     | 63.64                 |
-| General Knowledge          | 47.14                                 | 40.00                                     | TBA                                   | TBA                                       | 54.29     | 28.57     | 74.29                 |
-| HHH Alignment              | 43.53                                 | 44.80                                     | TBA                                   | TBA                                       | 45.39     | 44.80     | 46.07                 |
-| Intent Recognition         | 20.52                                 | 44.22                                     | TBA                                   | TBA                                       | 51.45     | 58.82     | 83.67                 |
-| Misconceptions             | 50.75                                 | 52.24                                     | TBA                                   | TBA                                       | 52.99     | 46.27     | 52.99                 |
-| Paraphrase                 | 50.50                                 | 58.50                                     | TBA                                   | TBA                                       | 53.00     | 54.50     | 55.00                 |
-| Sentence Ambiguity         | 53.33                                 | 48.33                                     | TBA                                   | TBA                                       | 51.67     | 53.33     | 66.67                 |
-| Similarities Abstraction   | 69.74                                 | 72.37                                     | TBA                                   | TBA                                       | 64.47     | 73.68     | 75.00                 |
-| **Non-Arithmetic Average** | **48.48**                             | **51.49**                                 | TBA                                   | TBA                                       | **51.19** | **50.94** | **61.96**             |
-| **Overall Average**        | **42.87**                             | **47.27**                                 | TBA                                   | TBA                                       | **46.99** | **48.07** | **57.36**             |
 As we can see, Ahma-3B-Instruct model outperforms 2X larger models like the FinGPT 8B and Viking 7B, especially in non-arithmetic tasks in 0-shot usage. Even the 10X larger Poro 34B model, which is generally better, doesn't show a huge performance difference considering its size, and Ahma-3B-Instruct actually surpasses it in some tasks.
@@ -221,31 +221,31 @@ This Ahma-3B-Instruct model was primarily evaluated using [MTBench Finnish by Lu
 Single-turn results:
-| Benchmark           | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct |
-|:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|
-| Coding              | 1.00                                  | 1.00             | TBA                                   | TBA              |
-| Extraction          | 2.00                                  | 1.30             | TBA                                   | TBA              |
-| Humanities          | 4.05                                  | 6.20             | TBA                                   | TBA              |
-| Math                | 3.00                                  | 3.20             | TBA                                   | TBA              |
-| Reasoning           | 2.90                                  | 4.60             | TBA                                   | TBA              |
-| Roleplay            | 4.80                                  | 6.50             | TBA                                   | TBA              |
-| STEM                | 5.10                                  | 5.95             | TBA                                   | TBA              |
-| Writing             | 6.60                                  | 9.00             | TBA                                   | TBA              |
-| **Overall Average** | **3.68**                              | **4.72**         | TBA                                   | TBA              |
 Multi-turn results:
-| Benchmark           | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct | Poro 34B Chat |
-|:--------------------|:--------------------------------------|:-----------------|:--------------------------------------|:-----------------|:--------------|
-| Coding              | 1.00                                  | 1.00             | TBA                                   | TBA              | 3.70          |
-| Extraction          | 1.55                                  | 1.15             | TBA                                   | TBA              | 6.37          |
-| Humanities          | 3.25                                  | 6.20             | TBA                                   | TBA              | 9.25          |
-| Math                | 2.20                                  | 2.70             | TBA                                   | TBA              | 1.20          |
-| Reasoning           | 2.45                                  | 3.50             | TBA                                   | TBA              | 4.35          |
-| Roleplay            | 4.90                                  | 6.40             | TBA                                   | TBA              | 7.35          |
-| STEM                | 4.20                                  | 4.78             | TBA                                   | TBA              | 7.80          |
-| Writing             | 3.80                                  | 6.65             | TBA                                   | TBA              | 8.50          |
-| **Overall Average** | **2.92**                              | **4.05**         | TBA                                   | TBA              | **6.06**      |
 As we can see, the Ahma-3B-Instruct model significantly improves upon the base Ahma-3B model, especially in tasks like writing. It's also worth noting that the Ahma-3B-Instruct model shows enhanced performance in multi-turn tasks compared to the base model, which highlights the value of the multi-turn training examples used in the fine-tuning process. The Ahma-3B-Instruct model lost 14% of its single-turn overall score in a multi-turn setting, while the base Ahma-3B model lost 21%. Therefore, this instruct model might be better suited for chat use cases as well. As expected, coding performance was poor since the Ahma models aren't trained on code data.

 | Benchmark                  | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
 |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
+| Analogies                  | 50.77                                 | 48.46                                     | 56.92                                 | 41.54                                     | 49.23     | 40.00     | 54.62                 |
+| Arithmetic                 | 27.64                                 | 22.14                                     | 11.50                                 | 14.70                                     | 33.15     | 30.16     | 30.34                 |
+| Cause and Effect           | 59.48                                 | 58.82                                     | 59.48                                 | 53.60                                     | 66.01     | 58.82     | 62.74                 |
+| Emotions                   | 36.25                                 | 28.12                                     | 36.25                                 | 27.50                                     | 22.50     | 26.25     | 35.63                 |
+| Empirical Judgements       | 33.33                                 | 35.35                                     | 33.33                                 | 33.33                                     | 27.27     | 33.33     | 49.49                 |
+| General Knowledge          | 44.29                                 | 48.57                                     | 51.43                                 | 37.14                                     | 40.00     | 24.29     | 51.43                 |
+| HHH Alignment              | 42.09                                 | 41.66                                     | 44.23                                 | 43.22                                     | 41.81     | 42.51     | 42.92                 |
+| Intent Recognition         | 24.42                                 | 26.16                                     | 43.64                                 | 56.94                                     | 17.49     | 22.40     | 68.35                 |
+| Misconceptions             | 46.27                                 | 47.01                                     | 46.27                                 | 47.01                                     | 53.73     | 53.73     | 52.24                 |
+| Paraphrase                 | 59.50                                 | 73.00                                     | 67.00                                 | 70.50                                     | 51.00     | 50.00     | 51.00                 |
+| Sentence Ambiguity         | 53.33                                 | 65.00                                     | 60.00                                 | 63.33                                     | 51.67     | 48.33     | 50.00                 |
+| Similarities Abstraction   | 65.79                                 | 68.42                                     | 71.05                                 | 61.84                                     | 60.53     | 65.79     | 60.53                 |
+| **Non-Arithmetic Average** | **47.55**                             | **48.95**                                 | **51.33**                             | **48.30**                                 | **46.17** | **44.42** | **52.08**             |
+| **Overall Average**        | **36.49**                             | **34.06**                                 | **29.20**                             | **29.64**                                 | **38.93** | **36.50** | **40.00**             |
 3-shot results:
 | Benchmark                  | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | FinGPT 8B | Viking 7B | Poro 34B (8bit quant) |
 |:---------------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:----------|:----------|:----------------------|
+| Analogies                  | 50.77                                 | 49.23                                     | 49.23                                 | 43.08                                     | 40.77     | 54.62     | 76.92                 |
+| Arithmetic                 | 38.38                                 | 43.89                                     | 20.88                                 | 26.81                                     | 43.63     | 45.78     | 53.68                 |
+| Cause and Effect           | 60.78                                 | 64.71                                     | 66.01                                 | 62.74                                     | 64.05     | 58.17     | 67.32                 |
+| Emotions                   | 30.00                                 | 41.25                                     | 30.00                                 | 53.75                                     | 44.37     | 48.13     | 56.87                 |
+| Empirical Judgements       | 46.46                                 | 44.44                                     | 39.39                                 | 39.39                                     | 32.32     | 43.43     | 63.64                 |
+| General Knowledge          | 47.14                                 | 40.00                                     | 27.14                                 | 44.29                                     | 54.29     | 28.57     | 74.29                 |
+| HHH Alignment              | 43.53                                 | 44.80                                     | 43.80                                 | 45.09                                     | 45.39     | 44.80     | 46.07                 |
+| Intent Recognition         | 20.52                                 | 44.22                                     | 36.42                                 | 39.02                                     | 51.45     | 58.82     | 83.67                 |
+| Misconceptions             | 50.75                                 | 52.24                                     | 46.27                                 | 51.49                                     | 52.99     | 46.27     | 52.99                 |
+| Paraphrase                 | 50.50                                 | 58.50                                     | 57.50                                 | 65.00                                     | 53.00     | 54.50     | 55.00                 |
+| Sentence Ambiguity         | 53.33                                 | 48.33                                     | 53.33                                 | 51.67                                     | 51.67     | 53.33     | 66.67                 |
+| Similarities Abstraction   | 69.74                                 | 72.37                                     | 72.37                                 | 69.74                                     | 64.47     | 73.68     | 75.00                 |
+| **Non-Arithmetic Average** | **48.48**                             | **51.49**                                 | **49.05**                             | **51.63**                                 | **51.19** | **50.94** | **61.96**             |
+| **Overall Average**        | **42.87**                             | **47.27**                                 | **33.41**                             | **37.84**                                 | **46.99** | **48.07** | **57.36**             |
 As we can see, Ahma-3B-Instruct model outperforms 2X larger models like the FinGPT 8B and Viking 7B, especially in non-arithmetic tasks in 0-shot usage. Even the 10X larger Poro 34B model, which is generally better, doesn't show a huge performance difference considering its size, and Ahma-3B-Instruct actually surpasses it in some tasks.
 Single-turn results:
+| Benchmark           | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) |
+|:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|
+| Coding              | 1.00                                  | 1.00                                      | 1.70                                  | 1.10                                      |
+| Extraction          | 2.00                                  | 1.30                                      | 3.10                                  | 3.00                                      |
+| Humanities          | 4.05                                  | 6.20                                      | 6.60                                  | 8.00                                      |
+| Math                | 3.00                                  | 3.20                                      | 3.90                                  | 2.90                                      |
+| Reasoning           | 2.90                                  | 4.60                                      | 3.70                                  | 5.70                                      |
+| Roleplay            | 4.80                                  | 6.50                                      | 6.60                                  | 7.20                                      |
+| STEM                | 5.10                                  | 5.95                                      | 6.75                                  | 7.30                                      |
+| Writing             | 6.60                                  | 9.00                                      | 7.10                                  | 8.80                                      |
+| **Overall Average** | **3.68**                              | **4.72**                                  | **4.93**                              | **5.50**                                  |
 Multi-turn results:
+| Benchmark           | Ahma 3B base (instruct prompt format) | Ahma 3B Instruct (instruct prompt format) | Ahma 7B base (instruct prompt format) | Ahma 7B Instruct (instruct prompt format) | Poro 34B Chat |
+|:--------------------|:--------------------------------------|:------------------------------------------|:--------------------------------------|:------------------------------------------|:--------------|
+| Coding              | 1.00                                  | 1.00                                      | 1.40                                  | 1.05                                      | 3.70          |
+| Extraction          | 1.55                                  | 1.15                                      | 2.05                                  | 2.65                                      | 6.37          |
+| Humanities          | 3.25                                  | 6.20                                      | 4.95                                  | 7.85                                      | 9.25          |
+| Math                | 2.20                                  | 2.70                                      | 2.50                                  | 2.40                                      | 1.20          |
+| Reasoning           | 2.45                                  | 3.50                                      | 2.55                                  | 4.50                                      | 4.35          |
+| Roleplay            | 4.90                                  | 6.40                                      | 6.35                                  | 6.60                                      | 7.35          |
+| STEM                | 4.20                                  | 4.78                                      | 4.28                                  | 5.40                                      | 7.80          |
+| Writing             | 3.80                                  | 6.65                                      | 4.10                                  | 6.25                                      | 8.50          |
+| **Overall Average** | **2.92**                              | **4.05**                                  | **3.52**                              | **4.59**                                  | **6.06**      |
 As we can see, the Ahma-3B-Instruct model significantly improves upon the base Ahma-3B model, especially in tasks like writing. It's also worth noting that the Ahma-3B-Instruct model shows enhanced performance in multi-turn tasks compared to the base model, which highlights the value of the multi-turn training examples used in the fine-tuning process. The Ahma-3B-Instruct model lost 14% of its single-turn overall score in a multi-turn setting, while the base Ahma-3B model lost 21%. Therefore, this instruct model might be better suited for chat use cases as well. As expected, coding performance was poor since the Ahma models aren't trained on code data.