Update README.md
Browse files
README.md
CHANGED
@@ -86,7 +86,7 @@ print(response)
|
|
86 |
|
87 |
π: Proprietary
|
88 |
|
89 |
-
### 3.1 Arena-Hard-Auto
|
90 |
|
91 |
All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
|
92 |
|
@@ -148,6 +148,26 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
|
|
148 |
| Yi-Large-Preview π | 7.20 |
|
149 |
|
150 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
151 |
|
152 |
## References
|
153 |
|
|
|
86 |
|
87 |
π: Proprietary
|
88 |
|
89 |
+
### 3.1 Arena-Hard-Auto-v0.1
|
90 |
|
91 |
All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
|
92 |
|
|
|
148 |
| Yi-Large-Preview π | 7.20 |
|
149 |
|
150 |
|
151 |
+
### 3.3 MT-Bench
|
152 |
+
|
153 |
+
> [!IMPORTANT]
|
154 |
+
> We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
|
155 |
+
|
156 |
+
| | Score |
|
157 |
+
| ----------------------------- | ------------------------ |
|
158 |
+
| **Xwen-72B-Chat** π | **8.64** (Top-1 Among π) |
|
159 |
+
| Qwen2.5-72B-Chat π | 8.62 |
|
160 |
+
| Deepseek V2.5 π | 8.43 |
|
161 |
+
| Mistral-Large-Instruct-2407 π | 8.53 |
|
162 |
+
| Llama3.1-70B-Instruct π | 8.23 |
|
163 |
+
| Llama-3.1-405B-Instruct-FP8 π | 8.36 |
|
164 |
+
| GPT-4o-0513 π | 8.59 |
|
165 |
+
| Claude-3.5-Sonnet-20240620 π | 6.96 |
|
166 |
+
| Yi-Lightning π | **8.75** (Top-1 Among π) |
|
167 |
+
| Yi-Large-Preview π | 8.32 |
|
168 |
+
|
169 |
+
|
170 |
+
|
171 |
|
172 |
## References
|
173 |
|