leaderboard-pr-bot commited on
Commit
16e5cde
1 Parent(s): 927e6c9

Adding Evaluation Results

Browse files

This is an automated PR created with https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr

The purpose of this PR is to add evaluation results from the Open LLM Leaderboard to your model card.

If you encounter any issues, please report them to https://huggingface.co/spaces/Weyaxi/open-llm-leaderboard-results-pr/discussions

Files changed (1) hide show
  1. README.md +151 -38
README.md CHANGED
@@ -7,16 +7,16 @@ language:
7
  - vi
8
  - ms
9
  - lo
 
 
 
 
 
10
  datasets:
11
  - cerebras/SlimPajama-627B
12
  - Skywork/SkyPile-150B
13
  - allenai/MADLAD-400
14
  - cc100
15
- tags:
16
- - multilingual
17
- - sea
18
- - sailor
19
- license: apache-2.0
20
  base_model: Qwen/Qwen1.5-4B
21
  model-index:
22
  - name: Sailor-4B
@@ -27,117 +27,217 @@ model-index:
27
  name: XQuAD-Thai
28
  type: XQuAD-Thai
29
  metrics:
30
- - name: EM (3-Shot)
31
- type: EM (3-Shot)
32
  value: 46.82
33
- - name: F1 (3-Shot)
34
- type: F1 (3-Shot)
35
  value: 63.34
 
36
  - task:
37
  type: text-generation
38
  dataset:
39
  name: TyDiQA-Indonesian
40
  type: TyDiQA-Indonesian
41
  metrics:
42
- - name: EM (3-Shot)
43
- type: EM (3-Shot)
44
  value: 53.98
45
- - name: F1 (3-Shot)
46
- type: F1 (3-Shot)
47
  value: 73.48
 
48
  - task:
49
  type: text-generation
50
  dataset:
51
  name: XQuAD-Vietnamese
52
  type: XQuAD-Vietnamese
53
  metrics:
54
- - name: EM (3-Shot)
55
- type: EM (3-Shot)
56
  value: 47.65
57
- - name: F1 (3-Shot)
58
- type: F1 (3-Shot)
59
  value: 67.09
 
60
  - task:
61
  type: text-generation
62
  dataset:
63
  name: XCOPA-Thai
64
  type: XCOPA-Thai
65
  metrics:
66
- - name: EM (3-Shot)
67
- type: EM (3-Shot)
68
  value: 53.4
 
69
  - task:
70
  type: text-generation
71
  dataset:
72
  name: XCOPA-Indonesian
73
  type: XCOPA-Indonesian
74
  metrics:
75
- - name: EM (3-Shot)
76
- type: EM (3-Shot)
77
- value: 69.20
78
  - task:
79
  type: text-generation
80
  dataset:
81
  name: XCOPA-Vietnamese
82
  type: XCOPA-Vietnamese
83
  metrics:
84
- - name: EM (3-Shot)
85
- type: EM (3-Shot)
86
- value: 68.20
87
  - task:
88
  type: text-generation
89
  dataset:
90
  name: M3Exam-Thai
91
  type: M3Exam-Thai
92
  metrics:
93
- - name: EM (3-Shot)
94
- type: EM (3-Shot)
95
  value: 27.88
 
96
  - task:
97
  type: text-generation
98
  dataset:
99
  name: M3Exam-Indonesian
100
  type: M3Exam-Indonesian
101
  metrics:
102
- - name: EM (3-Shot)
103
- type: EM (3-Shot)
104
  value: 31.27
 
105
  - task:
106
  type: text-generation
107
  dataset:
108
  name: M3Exam-Vietnamese
109
  type: M3Exam-Vietnamese
110
  metrics:
111
- - name: EM (3-Shot)
112
- type: EM (3-Shot)
113
  value: 40.69
 
114
  - task:
115
  type: text-generation
116
  dataset:
117
  name: BELEBELE-Thai
118
  type: BELEBELE-Thai
119
  metrics:
120
- - name: EM (3-Shot)
121
- type: EM (3-Shot)
122
  value: 36.11
 
123
  - task:
124
  type: text-generation
125
  dataset:
126
  name: BELEBELE-Indonesian
127
  type: BELEBELE-Indonesian
128
  metrics:
129
- - name: EM (3-Shot)
130
- type: EM (3-Shot)
131
  value: 41.33
 
132
  - task:
133
  type: text-generation
134
  dataset:
135
  name: BELEBELE-Vietnamese
136
  type: BELEBELE-Vietnamese
137
  metrics:
138
- - name: EM (3-Shot)
139
- type: EM (3-Shot)
140
  value: 38.89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
  ---
142
 
143
  <div align="center">
@@ -210,4 +310,17 @@ No restrict on the research and the commercial use, but should comply with the [
210
 
211
  # Contact Us
212
 
213
- If you have any questions, please raise an issue or contact us at [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected]).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - vi
8
  - ms
9
  - lo
10
+ license: apache-2.0
11
+ tags:
12
+ - multilingual
13
+ - sea
14
+ - sailor
15
  datasets:
16
  - cerebras/SlimPajama-627B
17
  - Skywork/SkyPile-150B
18
  - allenai/MADLAD-400
19
  - cc100
 
 
 
 
 
20
  base_model: Qwen/Qwen1.5-4B
21
  model-index:
22
  - name: Sailor-4B
 
27
  name: XQuAD-Thai
28
  type: XQuAD-Thai
29
  metrics:
30
+ - type: EM (3-Shot)
 
31
  value: 46.82
32
+ name: EM (3-Shot)
33
+ - type: F1 (3-Shot)
34
  value: 63.34
35
+ name: F1 (3-Shot)
36
  - task:
37
  type: text-generation
38
  dataset:
39
  name: TyDiQA-Indonesian
40
  type: TyDiQA-Indonesian
41
  metrics:
42
+ - type: EM (3-Shot)
 
43
  value: 53.98
44
+ name: EM (3-Shot)
45
+ - type: F1 (3-Shot)
46
  value: 73.48
47
+ name: F1 (3-Shot)
48
  - task:
49
  type: text-generation
50
  dataset:
51
  name: XQuAD-Vietnamese
52
  type: XQuAD-Vietnamese
53
  metrics:
54
+ - type: EM (3-Shot)
 
55
  value: 47.65
56
+ name: EM (3-Shot)
57
+ - type: F1 (3-Shot)
58
  value: 67.09
59
+ name: F1 (3-Shot)
60
  - task:
61
  type: text-generation
62
  dataset:
63
  name: XCOPA-Thai
64
  type: XCOPA-Thai
65
  metrics:
66
+ - type: EM (3-Shot)
 
67
  value: 53.4
68
+ name: EM (3-Shot)
69
  - task:
70
  type: text-generation
71
  dataset:
72
  name: XCOPA-Indonesian
73
  type: XCOPA-Indonesian
74
  metrics:
75
+ - type: EM (3-Shot)
76
+ value: 69.2
77
+ name: EM (3-Shot)
78
  - task:
79
  type: text-generation
80
  dataset:
81
  name: XCOPA-Vietnamese
82
  type: XCOPA-Vietnamese
83
  metrics:
84
+ - type: EM (3-Shot)
85
+ value: 68.2
86
+ name: EM (3-Shot)
87
  - task:
88
  type: text-generation
89
  dataset:
90
  name: M3Exam-Thai
91
  type: M3Exam-Thai
92
  metrics:
93
+ - type: EM (3-Shot)
 
94
  value: 27.88
95
+ name: EM (3-Shot)
96
  - task:
97
  type: text-generation
98
  dataset:
99
  name: M3Exam-Indonesian
100
  type: M3Exam-Indonesian
101
  metrics:
102
+ - type: EM (3-Shot)
 
103
  value: 31.27
104
+ name: EM (3-Shot)
105
  - task:
106
  type: text-generation
107
  dataset:
108
  name: M3Exam-Vietnamese
109
  type: M3Exam-Vietnamese
110
  metrics:
111
+ - type: EM (3-Shot)
 
112
  value: 40.69
113
+ name: EM (3-Shot)
114
  - task:
115
  type: text-generation
116
  dataset:
117
  name: BELEBELE-Thai
118
  type: BELEBELE-Thai
119
  metrics:
120
+ - type: EM (3-Shot)
 
121
  value: 36.11
122
+ name: EM (3-Shot)
123
  - task:
124
  type: text-generation
125
  dataset:
126
  name: BELEBELE-Indonesian
127
  type: BELEBELE-Indonesian
128
  metrics:
129
+ - type: EM (3-Shot)
 
130
  value: 41.33
131
+ name: EM (3-Shot)
132
  - task:
133
  type: text-generation
134
  dataset:
135
  name: BELEBELE-Vietnamese
136
  type: BELEBELE-Vietnamese
137
  metrics:
138
+ - type: EM (3-Shot)
 
139
  value: 38.89
140
+ name: EM (3-Shot)
141
+ - task:
142
+ type: text-generation
143
+ name: Text Generation
144
+ dataset:
145
+ name: AI2 Reasoning Challenge (25-Shot)
146
+ type: ai2_arc
147
+ config: ARC-Challenge
148
+ split: test
149
+ args:
150
+ num_few_shot: 25
151
+ metrics:
152
+ - type: acc_norm
153
+ value: 44.45
154
+ name: normalized accuracy
155
+ source:
156
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sail/Sailor-4B
157
+ name: Open LLM Leaderboard
158
+ - task:
159
+ type: text-generation
160
+ name: Text Generation
161
+ dataset:
162
+ name: HellaSwag (10-Shot)
163
+ type: hellaswag
164
+ split: validation
165
+ args:
166
+ num_few_shot: 10
167
+ metrics:
168
+ - type: acc_norm
169
+ value: 69.53
170
+ name: normalized accuracy
171
+ source:
172
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sail/Sailor-4B
173
+ name: Open LLM Leaderboard
174
+ - task:
175
+ type: text-generation
176
+ name: Text Generation
177
+ dataset:
178
+ name: MMLU (5-Shot)
179
+ type: cais/mmlu
180
+ config: all
181
+ split: test
182
+ args:
183
+ num_few_shot: 5
184
+ metrics:
185
+ - type: acc
186
+ value: 38.99
187
+ name: accuracy
188
+ source:
189
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sail/Sailor-4B
190
+ name: Open LLM Leaderboard
191
+ - task:
192
+ type: text-generation
193
+ name: Text Generation
194
+ dataset:
195
+ name: TruthfulQA (0-shot)
196
+ type: truthful_qa
197
+ config: multiple_choice
198
+ split: validation
199
+ args:
200
+ num_few_shot: 0
201
+ metrics:
202
+ - type: mc2
203
+ value: 37.02
204
+ source:
205
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sail/Sailor-4B
206
+ name: Open LLM Leaderboard
207
+ - task:
208
+ type: text-generation
209
+ name: Text Generation
210
+ dataset:
211
+ name: Winogrande (5-shot)
212
+ type: winogrande
213
+ config: winogrande_xl
214
+ split: validation
215
+ args:
216
+ num_few_shot: 5
217
+ metrics:
218
+ - type: acc
219
+ value: 66.06
220
+ name: accuracy
221
+ source:
222
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sail/Sailor-4B
223
+ name: Open LLM Leaderboard
224
+ - task:
225
+ type: text-generation
226
+ name: Text Generation
227
+ dataset:
228
+ name: GSM8k (5-shot)
229
+ type: gsm8k
230
+ config: main
231
+ split: test
232
+ args:
233
+ num_few_shot: 5
234
+ metrics:
235
+ - type: acc
236
+ value: 9.1
237
+ name: accuracy
238
+ source:
239
+ url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=sail/Sailor-4B
240
+ name: Open LLM Leaderboard
241
  ---
242
 
243
  <div align="center">
 
310
 
311
  # Contact Us
312
 
313
+ If you have any questions, please raise an issue or contact us at [[email protected]](mailto:[email protected]) or [[email protected]](mailto:[email protected]).
314
+ # [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
315
+ Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_sail__Sailor-4B)
316
+
317
+ | Metric |Value|
318
+ |---------------------------------|----:|
319
+ |Avg. |44.19|
320
+ |AI2 Reasoning Challenge (25-Shot)|44.45|
321
+ |HellaSwag (10-Shot) |69.53|
322
+ |MMLU (5-Shot) |38.99|
323
+ |TruthfulQA (0-shot) |37.02|
324
+ |Winogrande (5-shot) |66.06|
325
+ |GSM8k (5-shot) | 9.10|
326
+