Spaces:
Running
Running
update readme doc
Browse files- constants.py +5 -3
constants.py
CHANGED
@@ -29,14 +29,14 @@ We aim to provide cost-effective and accurate evaluation for multimodal models,
|
|
29 |
## ππ Results & Takeaways from Evaluating Top Models
|
30 |
|
31 |
|
32 |
-
### οΈβπ₯π
|
33 |
|
34 |
- **Gemini 2.0 Experimental (1206)** and **Gemini 2.0 Flash Experimental** outperform **GPT-4o** and **Claude 3.5 Sonnet**.
|
35 |
- We add **Grok-2-vision-1212** to the single-image leaderboard. The model seems to use a lot of tokens per image, and cannot run many of our multi-image and video tasks.
|
36 |
- We will evaluate o1 series models when there is budget.
|
37 |
|
38 |
|
39 |
-
### π
|
40 |
|
41 |
- **GPT-4o (0513)** and **Claude 3.5 Sonnet (1022)** lead the benchmark. **Claude 3.5 Sonnet (1022)** improves over **Claude 3.5 Sonnet (0620)** obviously in planning tasks (application dimension) and UI/Infographics inputs (input format dimension).
|
42 |
- **Qwen2-VL** stands out among open-source models, and its flagship model gets close to some proprietary flagship models
|
@@ -84,7 +84,9 @@ CITATION_BUTTON_TEXT = r"""
|
|
84 |
|
85 |
SUBMIT_INTRODUCTION = """# Submit on MEGA-Bench Leaderboard
|
86 |
|
87 |
-
Our evaluation pipeline is released on our [GitHub repository](https://github.com/TIGER-AI-Lab/MEGA-Bench).
|
|
|
|
|
88 |
|
89 |
"""
|
90 |
|
|
|
29 |
## ππ Results & Takeaways from Evaluating Top Models
|
30 |
|
31 |
|
32 |
+
### οΈβπ₯π 2025.01
|
33 |
|
34 |
- **Gemini 2.0 Experimental (1206)** and **Gemini 2.0 Flash Experimental** outperform **GPT-4o** and **Claude 3.5 Sonnet**.
|
35 |
- We add **Grok-2-vision-1212** to the single-image leaderboard. The model seems to use a lot of tokens per image, and cannot run many of our multi-image and video tasks.
|
36 |
- We will evaluate o1 series models when there is budget.
|
37 |
|
38 |
|
39 |
+
### π 2024.11
|
40 |
|
41 |
- **GPT-4o (0513)** and **Claude 3.5 Sonnet (1022)** lead the benchmark. **Claude 3.5 Sonnet (1022)** improves over **Claude 3.5 Sonnet (0620)** obviously in planning tasks (application dimension) and UI/Infographics inputs (input format dimension).
|
42 |
- **Qwen2-VL** stands out among open-source models, and its flagship model gets close to some proprietary flagship models
|
|
|
84 |
|
85 |
SUBMIT_INTRODUCTION = """# Submit on MEGA-Bench Leaderboard
|
86 |
|
87 |
+
Our evaluation pipeline is released on our [GitHub repository](https://github.com/TIGER-AI-Lab/MEGA-Bench).
|
88 |
+
|
89 |
+
The evaluation results processed by the [breakdown analysis script](https://github.com/TIGER-AI-Lab/MEGA-Bench?tab=readme-ov-file#get-multi-dimensional-breakdown-analysis) are put into this leaderboard.
|
90 |
|
91 |
"""
|
92 |
|