cccjc commited on
Commit
61704fb
Β·
1 Parent(s): ca8eb95

update readme doc

Browse files
Files changed (1) hide show
  1. constants.py +5 -3
constants.py CHANGED
@@ -29,14 +29,14 @@ We aim to provide cost-effective and accurate evaluation for multimodal models,
29
  ## πŸ“ŠπŸ” Results & Takeaways from Evaluating Top Models
30
 
31
 
32
- ### ️‍πŸ”₯πŸ“ January 2025
33
 
34
  - **Gemini 2.0 Experimental (1206)** and **Gemini 2.0 Flash Experimental** outperform **GPT-4o** and **Claude 3.5 Sonnet**.
35
  - We add **Grok-2-vision-1212** to the single-image leaderboard. The model seems to use a lot of tokens per image, and cannot run many of our multi-image and video tasks.
36
  - We will evaluate o1 series models when there is budget.
37
 
38
 
39
- ### πŸ“ November 2024
40
 
41
  - **GPT-4o (0513)** and **Claude 3.5 Sonnet (1022)** lead the benchmark. **Claude 3.5 Sonnet (1022)** improves over **Claude 3.5 Sonnet (0620)** obviously in planning tasks (application dimension) and UI/Infographics inputs (input format dimension).
42
  - **Qwen2-VL** stands out among open-source models, and its flagship model gets close to some proprietary flagship models
@@ -84,7 +84,9 @@ CITATION_BUTTON_TEXT = r"""
84
 
85
  SUBMIT_INTRODUCTION = """# Submit on MEGA-Bench Leaderboard
86
 
87
- Our evaluation pipeline is released on our [GitHub repository](https://github.com/TIGER-AI-Lab/MEGA-Bench). We will provide details on how to submit third-party results to this leaderboard.
 
 
88
 
89
  """
90
 
 
29
  ## πŸ“ŠπŸ” Results & Takeaways from Evaluating Top Models
30
 
31
 
32
+ ### ️‍πŸ”₯πŸ“ 2025.01
33
 
34
  - **Gemini 2.0 Experimental (1206)** and **Gemini 2.0 Flash Experimental** outperform **GPT-4o** and **Claude 3.5 Sonnet**.
35
  - We add **Grok-2-vision-1212** to the single-image leaderboard. The model seems to use a lot of tokens per image, and cannot run many of our multi-image and video tasks.
36
  - We will evaluate o1 series models when there is budget.
37
 
38
 
39
+ ### πŸ“ 2024.11
40
 
41
  - **GPT-4o (0513)** and **Claude 3.5 Sonnet (1022)** lead the benchmark. **Claude 3.5 Sonnet (1022)** improves over **Claude 3.5 Sonnet (0620)** obviously in planning tasks (application dimension) and UI/Infographics inputs (input format dimension).
42
  - **Qwen2-VL** stands out among open-source models, and its flagship model gets close to some proprietary flagship models
 
84
 
85
  SUBMIT_INTRODUCTION = """# Submit on MEGA-Bench Leaderboard
86
 
87
+ Our evaluation pipeline is released on our [GitHub repository](https://github.com/TIGER-AI-Lab/MEGA-Bench).
88
+
89
+ The evaluation results processed by the [breakdown analysis script](https://github.com/TIGER-AI-Lab/MEGA-Bench?tab=readme-ov-file#get-multi-dimensional-breakdown-analysis) are put into this leaderboard.
90
 
91
  """
92