add mmlu_pro benchmark results
Browse files
README.md
CHANGED
@@ -11,4 +11,50 @@ tags:
|
|
11 |
|
12 |
# DeepSeek-R1-Distill-Qwen-32B-AWQ wint4
|
13 |
|
14 |
-
Distillation of DeepSeek-R1 to Qwen 32B, quantized using AWQ to wint4
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
# DeepSeek-R1-Distill-Qwen-32B-AWQ wint4
|
13 |
|
14 |
+
Distillation of DeepSeek-R1 to Qwen 32B, quantized using AWQ to wint4
|
15 |
+
|
16 |
+
|
17 |
+
## Benchmarks:
|
18 |
+
|
19 |
+
Here's how you can convert the given data into a Markdown format, including a short description about the benchmark:
|
20 |
+
|
21 |
+
## MMLU-PRO
|
22 |
+
|
23 |
+
The MMLU-PRO dataset evaluates subjects across 14 distinct fields using a 5-shot accuracy measurement. Each task assesses models following the methodology of the original MMLU implementation, with each having ten possible choices.
|
24 |
+
|
25 |
+
### Measure
|
26 |
+
|
27 |
+
- **Accuracy**: Evaluated as "exact_match"
|
28 |
+
|
29 |
+
### Shots
|
30 |
+
|
31 |
+
- **Shots**: 5-shot
|
32 |
+
|
33 |
+
### Results Table
|
34 |
+
|
35 |
+
| Tasks | Version | Filter | n-shot | Metric | Direction | Value | Stderr |
|
36 |
+
|---------------------------|---------|---------------|--------|------------|-----------|-------|--------|
|
37 |
+
| mmlu_pro | 2 | custom-extract| | exact_match| ↑ | 0.5875| 0.0044 |
|
38 |
+
| biology | 1 | custom-extract| 5 | exact_match| ↑ | 0.7978| 0.0150 |
|
39 |
+
| business | 1 | custom-extract| 5 | exact_match| ↑ | 0.5982| 0.0175 |
|
40 |
+
| chemistry | 1 | custom-extract| 5 | exact_match| ↑ | 0.4691| 0.0148 |
|
41 |
+
| computer_science | 1 | custom-extract| 5 | exact_match| ↑ | 0.6122| 0.0241 |
|
42 |
+
| economics | 1 | custom-extract| 5 | exact_match| ↑ | 0.7346| 0.0152 |
|
43 |
+
| engineering | 1 | custom-extract| 5 | exact_match| ↑ | 0.3891| 0.0157 |
|
44 |
+
| health | 1 | custom-extract| 5 | exact_match| ↑ | 0.6345| 0.0168 |
|
45 |
+
| history | 1 | custom-extract| 5 | exact_match| ↑ | 0.6168| 0.0249 |
|
46 |
+
| law | 1 | custom-extract| 5 | exact_match| ↑ | 0.4596| 0.0150 |
|
47 |
+
| math | 1 | custom-extract| 5 | exact_match| ↑ | 0.6425| 0.0130 |
|
48 |
+
| other | 1 | custom-extract| 5 | exact_match| ↑ | 0.6223| 0.0160 |
|
49 |
+
| philosophy | 1 | custom-extract| 5 | exact_match| ↑ | 0.5731| 0.0222 |
|
50 |
+
| physics | 1 | custom-extract| 5 | exact_match| ↑ | 0.5073| 0.0139 |
|
51 |
+
| psychology | 1 | custom-extract| 5 | exact_match| ↑ | 0.7494| 0.0154 |
|
52 |
+
|
53 |
+
## Groups
|
54 |
+
|
55 |
+
| Groups | Version | Filter | n-shot | Metric | Direction | Value | Stderr |
|
56 |
+
|-----------|---------|---------------|--------|------------|-----------|-------|--------|
|
57 |
+
| mmlu_pro | 2 | custom-extract| | exact_match| ↑ | 0.5875| 0.0044 |
|
58 |
+
|
59 |
+
|
60 |
+
|