shenzhi-wang commited on
Commit
ff919ce
Β·
verified Β·
1 Parent(s): 7d4f2de

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -1
README.md CHANGED
@@ -86,7 +86,7 @@ print(response)
86
 
87
  πŸ”’: Proprietary
88
 
89
- ### 3.1 Arena-Hard-Auto
90
 
91
  All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
92
 
@@ -148,6 +148,26 @@ All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Har
148
  | Yi-Large-Preview πŸ”’ | 7.20 |
149
 
150
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
151
 
152
  ## References
153
 
 
86
 
87
  πŸ”’: Proprietary
88
 
89
+ ### 3.1 Arena-Hard-Auto-v0.1
90
 
91
  All results below, except those for `Xwen-72B-Chat`, are sourced from [Arena-Hard-Auto](https://github.com/lmarena/arena-hard-auto) (accessed on February 1, 2025).
92
 
 
148
  | Yi-Large-Preview πŸ”’ | 7.20 |
149
 
150
 
151
+ ### 3.3 MT-Bench
152
+
153
+ > [!IMPORTANT]
154
+ > We replaced the original judge model, `GPT-4`, in MT-Bench with the more powerful model, `GPT-4o-0513`. To keep fairness, all the results below are generated by ``GPT-4o-0513``. As a result, the following results may differ from the MT-Bench scores reported elsewhere.
155
+
156
+ | | Score |
157
+ | ----------------------------- | ------------------------ |
158
+ | **Xwen-72B-Chat** πŸ”‘ | **8.64** (Top-1 Among πŸ”‘) |
159
+ | Qwen2.5-72B-Chat πŸ”‘ | 8.62 |
160
+ | Deepseek V2.5 πŸ”‘ | 8.43 |
161
+ | Mistral-Large-Instruct-2407 πŸ”‘ | 8.53 |
162
+ | Llama3.1-70B-Instruct πŸ”‘ | 8.23 |
163
+ | Llama-3.1-405B-Instruct-FP8 πŸ”‘ | 8.36 |
164
+ | GPT-4o-0513 πŸ”’ | 8.59 |
165
+ | Claude-3.5-Sonnet-20240620 πŸ”’ | 6.96 |
166
+ | Yi-Lightning πŸ”’ | **8.75** (Top-1 Among πŸ”’) |
167
+ | Yi-Large-Preview πŸ”’ | 8.32 |
168
+
169
+
170
+
171
 
172
  ## References
173