Update README.md
Browse files
README.md
CHANGED
@@ -43,14 +43,16 @@ It achieves superb reasoning performance as well as exellent chat & instruction-
|
|
43 |
## Evaluation
|
44 |
We conducted overall coding, math, reasoning, knowledge, instruction-following and chat benchmarking. Results are shown below:
|
45 |
|
46 |
-
|
|
47 |
-
|
48 |
-
|
|
49 |
-
|
50 |
-
| GPT-
|
51 |
-
|
|
52 |
-
|
|
53 |
-
| Eurux-8x22b-
|
|
|
|
|
54 |
## Usage
|
55 |
|
56 |
```python
|
|
|
43 |
## Evaluation
|
44 |
We conducted overall coding, math, reasoning, knowledge, instruction-following and chat benchmarking. Results are shown below:
|
45 |
|
46 |
+
| Models | Tasks | Coding | | | Math | | | Reasoning | Knowledge | Ins-Following | Chat |
|
47 |
+
|-----------------|:---------:|:-----:|:--------:|:-------:|:-----:|:---------:|:---------:|:---------:|:-------------:|:--------:|
|
48 |
+
| Datasets | HumanEval | MBPP | LeetCode | GSMPLUS | MATH | TheoremQA | BBH (CoT) | MMLU | IFEval | MT-Bench |
|
49 |
+
|-----------------|:---------:|:-----:|:--------:|:-------:|:-----:|:---------:|:---------:|:---------:|:-------------:|:--------:|
|
50 |
+
| GPT-3.5-Turbo | 76.8 | 82.5 | 23.3 | 61.2 | 37.8 | 35.6 | 70.1 | 70.0 | 56.6 | 7.94 |
|
51 |
+
| GPT-4 | 85.4 | 83.5 | 41.8 | 85.6 | 69.7 | 52.4 | 86.7 | 86.4 | 79.7 | 8.96 |
|
52 |
+
| Eurus-70b-NCA | 79.3 | 71.9 | 33.3 | 62.8 | 41.7 | 32.6 | 80.0 | 59.4 | 49.2 | 7.54 |
|
53 |
+
| Eurux-8x22b-KTO | 71.3 | 68.9 | 29.4 | 68.3 | 48.4 | 35.3 | 83.6 | 75.9 | 67.1 | 8.58 |
|
54 |
+
| Eurux-8x22b-NCA | 75.0 | 69.7 | 35.0 | 68.1 | 49.0 | 35.5 | 83.5 | 75.6 | 67.1 | 8.46 |
|
55 |
+
|
56 |
## Usage
|
57 |
|
58 |
```python
|