tiiuae
/

Falcon3-10B-Instruct

@@ -184,7 +184,7 @@ print(response)
 ## Benchmarks
 We report in the following table our internal pipeline benchmarks.
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
- - We report **raw scores** obtained by applying chat template **without fewshot_as_multiturn** (unlike Llama3.1).
  - We use same batch-size across all models.
@@ -210,51 +210,51 @@ We report in the following table our internal pipeline benchmarks.
         <tr>
             <td rowspan="3">General</td>
             <td>MMLU (5-shot)</td>
-            <td>70</td>
-            <td>65.9</td>
-            <td><b>71.6</td>
         </tr>
         <tr>
             <td>MMLU-PRO (5-shot)</td>
-            <td>39.6</td>
-            <td>32.7</td>
-            <td><b>44</td>
         </tr>
         <tr>
             <td>IFEval</td>
             <td>57.6</td>
             <td>63.4</td>
-            <td><b>78</td>
         </tr>
         <tr>
             <td rowspan="3">Math</td>
             <td>GSM8K (5-shot)</td>
-            <td>76.6</td>
-            <td>73.8</td>
-            <td><b>83.1</td>
         </tr>
         <tr>
             <td>GSM8K (8-shot, COT)</td>
             <td>78.5</td>
             <td>73.6</td>
-            <td><b>81.3</td>
         </tr>
         <tr>
             <td>MATH Lvl-5 (4-shot)</td>
-            <td>8.8</td>
-            <td>0.4</td>
-            <td><b>22.1</td>
         </tr>
         <tr>
             <td rowspan="5">Reasoning</td>
             <td>Arc Challenge (25-shot)</td>
-            <td>51.9</td>
-            <td>61.6</td>
-            <td><b>64.5</td>
         </tr>
         <tr>
             <td>GPQA (0-shot)</td>
-            <td><b>35.4</td>
             <td>33.2</td>
             <td>33.5</td>
         </tr>
@@ -262,32 +262,32 @@ We report in the following table our internal pipeline benchmarks.
             <td>GPQA (0-shot, COT)</td>
             <td>16</td>
             <td>12.7</td>
-            <td><b>32.6</td>
         </tr>
         <tr>
             <td>MUSR (0-shot)</td>
-            <td><b>41.9</td>
             <td>38.1</td>
             <td>41.1</td>
         </tr>
         <tr>
             <td>BBH (3-shot)</td>
-            <td>49.2</td>
-            <td>43.6</td>
-            <td><b>58.4</td>
         </tr>
         <tr>
             <td rowspan="4">CommonSense Understanding</td>
             <td>PIQA (0-shot)</td>
             <td>76.4</td>
             <td>78.2</td>
-            <td><b>78.4</td>
         </tr>
         <tr>
             <td>SciQ (0-shot)</td>
             <td>61.7</td>
             <td>76.4</td>
-            <td><b>90.4</td>
         </tr>
         <tr>
             <td>Winogrande (0-shot)</td>
@@ -299,19 +299,19 @@ We report in the following table our internal pipeline benchmarks.
             <td>OpenbookQA (0-shot)</td>
             <td>43.2</td>
             <td>47.4</td>
-            <td><b>48.2</td>
         </tr>
         <tr>
             <td rowspan="2">Instructions following</td>
             <td>MT-Bench (avg)</td>
             <td>8.28</td>
-            <td><b>8.6</td>
             <td>8.17</td>
         </tr>
         <tr>
             <td>Alpaca (WC)</td>
             <td>25.81</td>
-            <td><b>45.44</td>
             <td>24.7</td>
         </tr>
         <tr>
@@ -319,7 +319,7 @@ We report in the following table our internal pipeline benchmarks.
             <td>BFCL AST (avg)</td>
             <td>48.4</td>
             <td>74.2</td>
-            <td><b>86.3</td>
         </tr>
         <tr>
             <td rowspan="2">Code</td>

 ## Benchmarks
 We report in the following table our internal pipeline benchmarks.
  - We use [lm-evaluation harness](https://github.com/EleutherAI/lm-evaluation-harness).
+ - We report **raw scores** obtained by applying chat template and fewshot_as_multiturn.
  - We use same batch-size across all models.
         <tr>
             <td rowspan="3">General</td>
             <td>MMLU (5-shot)</td>
+            <td>68.8</td>
+            <td>66.0</td>
+            <td><b>73.9</b></td>
         </tr>
         <tr>
             <td>MMLU-PRO (5-shot)</td>
+            <td>38.8</td>
+            <td>34.3</td>
+            <td><b>44</b></td>
         </tr>
         <tr>
             <td>IFEval</td>
             <td>57.6</td>
             <td>63.4</td>
+            <td><b>78</b></td>
         </tr>
         <tr>
             <td rowspan="3">Math</td>
             <td>GSM8K (5-shot)</td>
+            <td>77.1</td>
+            <td>77.6</td>
+            <td><b>84.9</b></td>
         </tr>
         <tr>
             <td>GSM8K (8-shot, COT)</td>
             <td>78.5</td>
             <td>73.6</td>
+            <td><b>81.3</b></td>
         </tr>
         <tr>
             <td>MATH Lvl-5 (4-shot)</td>
+            <td>3.3</td>
+            <td>5.9</td>
+            <td><b>22.1</b></td>
         </tr>
         <tr>
             <td rowspan="5">Reasoning</td>
             <td>Arc Challenge (25-shot)</td>
+            <td>58.3</td>
+            <td>63.4</td>
+            <td><b>66.2</b></td>
         </tr>
         <tr>
             <td>GPQA (0-shot)</td>
+            <td><b>35.6</b></td>
             <td>33.2</td>
             <td>33.5</td>
         </tr>
             <td>GPQA (0-shot, COT)</td>
             <td>16</td>
             <td>12.7</td>
+            <td><b>32.6</b></td>
         </tr>
         <tr>
             <td>MUSR (0-shot)</td>
+            <td><b>41.9</b></td>
             <td>38.1</td>
             <td>41.1</td>
         </tr>
         <tr>
             <td>BBH (3-shot)</td>
+            <td>50.6</td>
+            <td>47.5</td>
+            <td><b>58.4</b></td>
         </tr>
         <tr>
             <td rowspan="4">CommonSense Understanding</td>
             <td>PIQA (0-shot)</td>
             <td>76.4</td>
             <td>78.2</td>
+            <td><b>78.4</b></td>
         </tr>
         <tr>
             <td>SciQ (0-shot)</td>
             <td>61.7</td>
             <td>76.4</td>
+            <td><b>90.4</b></td>
         </tr>
         <tr>
             <td>Winogrande (0-shot)</td>
             <td>OpenbookQA (0-shot)</td>
             <td>43.2</td>
             <td>47.4</td>
+            <td><b>48.2</b></td>
         </tr>
         <tr>
             <td rowspan="2">Instructions following</td>
             <td>MT-Bench (avg)</td>
             <td>8.28</td>
+            <td><b>8.6</b></td>
             <td>8.17</td>
         </tr>
         <tr>
             <td>Alpaca (WC)</td>
             <td>25.81</td>
+            <td><b>45.44</b></td>
             <td>24.7</td>
         </tr>
         <tr>
             <td>BFCL AST (avg)</td>
             <td>48.4</td>
             <td>74.2</td>
+            <td><b>86.3</b></td>
         </tr>
         <tr>
             <td rowspan="2">Code</td>