File size: 12,166 Bytes
8553d06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2a2ba62
44b6d4e
 
 
8553d06
 
 
 
 
 
 
 
 
b552779
8553d06
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b552779
8553d06
6a59158
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d16a60b
f3f40fb
0f15478
6a59158
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d16a60b
f3f40fb
0f15478
6a59158
 
 
 
 
0f15478
d16a60b
0f15478
6a59158
 
d16a60b
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
import os

HF_TOKEN = os.environ.get("HF_TOKEN")

LEADERBOARD_INTRODUCTION = """# MEGA-Bench Leaderboard

## πŸš€ Introduction

[MEGA-Bench](https://tiger-ai-lab.github.io/MEGA-Bench/) is a comprehensive benchmark scaling multimodal evaluation to 500+ real-world tasks! 

We aim to provide cost-effective and accurate evaluation for multimodal models, covering a wide range of real-world tasks. You don't have to run models on dozens of benchmarks -- MEGA-Bench delivers a comprehensive performance report in a single benchmark.

## 🧐 Highlights of MEGA-Bench

- 505 diverse tasks evaluating multimodal models across 8 grand application types, 7 input visual formats, 6 output formats, and 10 general multimodal skills, covering single-image, multi-image, and video tasks
- Moves beyond multiple-choice questions, offering diverse output formats like numbers, code, LATEX, phrases, free-form responses, and more. We developed 45 customized metrics to accurately evaluate these diverse outputs
- Focuses on task diversity rather than repetitive examples, ensuring cost-efficient evaluation
- Provides fine-grained capability reports across application type, input/output formats, and required skills


## πŸ”¨ Systematic Annotation Process

- Guided by an initial application-driven taxonomy tree
- 16 expert annotators contributing to a 2-round process to develop 505 tasks
- Utilizes advanced tools for task design, review, and quality control
- Ensures high-quality data through continuous refinement and balanced task distribution


## πŸ“ŠπŸ” Results & Takeaways from Evaluating Top Models

- GPT-4o (0513) and Claude 3.5 Sonnet (1022) lead the benchmark. Claude 3.5 Sonnet (1022) improves over Claude 3.5 Sonnet (0620) obviously in planning tasks (application dimension) and UI/Infographics inputs (input format dimension).
- Qwen2-VL stands out among open-source models, and its flagship model gets close to some proprietary flagship models
- Chain-of-Thought (CoT) prompting improves proprietary models but has limited impact on open-source models
- Gemini 1.5 Flash performs the best among all the evaluated efficiency models, but struggles with UI and document tasks
- Many open-source models face challenges in adhering to output format instructions

## 🎯 Interactive Visualization 

Visit our [project page](https://tiger-ai-lab.github.io/MEGA-Bench/) to explore the interactive task taxonomy and radar maps, offering deep insights into model capabilities across multiple dimensions. Discover a comprehensive breakdown far beyond single-score evaluations.


## πŸ“š More Information

- Our evaluation pipeline is available on our [GitHub repo](https://github.com/TIGER-AI-Lab/MEGA-Bench).
- Check full details of our paper at [https://arxiv.org/abs/2410.10563](https://arxiv.org/abs/2410.10563)
- Hugging Face Datasets: [https://huggingface.co/datasets/TIGER-Lab/MEGA-Bench](https://huggingface.co/datasets/TIGER-Lab/MEGA-Bench)

"""

TABLE_INTRODUCTION = """
"""

DATA_INFO = """
### Data Sources
The data source of MEGA-Bench tasks have three main types:
- **Purely Self-designed:** The task is designed entirely by the annotator, and the annotator looks for the image or video resources from the Internet or even using code/simulator.
- **Inspired and adapted from existing benchmarks:** The task is inspired by existing benchmarks or datasets. The annotator collects the raw image/video data from existing datasets but does not use the original annotation. The annotator redesigns/repurposes the data by writing concrete task descriptions and creating new questions and answers, or using scripts to re-process the data for the designed task.  
- **Directly converted from existing benchmarks:** The task is directly converted from existing benchmarks or datasets. The annotator randomly samples a subset from the existing benchmark, directly using its image/video and the annotation without redesign.

In our annotation process, the first two task types are encouraged. The task reviewers strictly control the number of the third type and reject the task if an annotator submits many tasks of the third type.

Please refer to Table 17 of our [paper](https://arxiv.org/abs/2410.10563) for the detailed data source of all tasks in MEGA-Bench.
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite our paper and evaluation results below"
CITATION_BUTTON_TEXT = r"""
@article{chen2024mega-bench,
        title={MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks},
        author={Chen, Jiacheng and Liang, Tianhao and Siu, Sherman and Wang, Zhengqing and Wang, Kai and Wang, Yubo and Ni, Yuansheng and Zhu, Wang and Jiang, Ziyan and Lyu, Bohan and Jiang, Dongfu and He, Xuan and Liu, Yuan and Hu, Hexiang and Yue, Xiang and Chen, Wenhu},
        journal={arXiv preprint arXiv:2410.10563},
        year={2024},
}
"""

SUBMIT_INTRODUCTION = """# Submit on MEGA-Bench Leaderboard

Our evaluation pipeline is released on our [GitHub repository](https://github.com/TIGER-AI-Lab/MEGA-Bench). We will provide details on how to submit third-party results to this leaderboard.

"""



## Constants related to the leaderboard display


# Keep all the constant mappings outside the class
MODEL_NAME_MAP = {
    "Claude_3.5_new": "Claude-3.5-Sonnet (1022)",
    "GPT_4o": "GPT-4o (0513)",
    "Claude_3.5": "Claude-3.5-Sonnet (0620)",
    "Gemini_1.5_pro_002": "Gemini-1.5-Pro-002",
    "InternVL2_76B": "InternVL2-Llama3-76B",
    "Qwen2_VL_72B": "Qwen2-VL-72B",
    "llava_onevision_72B": "Llava-OneVision-72B",
    "NVLM": "NVLM-D-72B",
    "GPT_4o_mini": "GPT-4o mini",
    "Gemini_1.5_flash_002": "Gemini-1.5-Flash-002",
    "Pixtral_12B": "Pixtral 12B",
    "Aria": "Aria-MoE-25B",
    "Qwen2_VL_7B": "Qwen2-VL-7B",
    "InternVL2_8B": "InternVL2-8B",
    "llava_onevision_7B": "Llava-OneVision-7B",
    "Llama_3_2_11B": "Llama-3.2-11B",
    "Phi-3.5-vision": "Phi-3.5-Vision",
    "MiniCPM_v2.6": "MiniCPM-V2.6",
    "Idefics3": "Idefics3-8B-Llama3",
    "Aquila_VL_2B": "Aquila-VL-2B-llava-qwen",
    "POINTS_7B": "POINTS-Qwen2.5-7B",
    "Qwen2_VL_2B": "Qwen2-VL-2B",
    "InternVL2_2B": "InternVL2-2B",
    "Molmo_7B_D": "Molmo-7B-D-0924",
    "Molmo_72B": "Molmo-72B-0924",
    "Mammoth_VL": "Mammoth-VL-8B",
    "SmolVLM": "SmolVLM-1.7B",
    "POINTS_15_7B": "POINTS-1.5-8B",
    "InternVL2_5_78B": "InternVL2.5-78B",
    "InternVL2_5_2B": "InternVL2.5-2B",
    "InternVL2_5_8B": "InternVL2.5-8B",
    "Grok-2-vision-1212": "Grok-2-vision-1212",
    "Gemini-2.0-thinking": "Gemini-2.0-flash-thinking",
}

DIMENSION_NAME_MAP = {
    "skills": "Skills",
    "input_format": "Input Format",
    "output_format": "Output Format",
    "input_num": "Visual Input Number",
    "app": "Application"
}

KEYWORD_NAME_MAP = {
    # Skills
    "Object Recognition and Classification": "Object Recognition",
    "Text Recognition (OCR)": "OCR",
    "Language Understanding and Generation": "Language",
    "Scene and Event Understanding": "Scene/Event",
    "Mathematical and Logical Reasoning": "Math/Logic",
    "Commonsense and Social Reasoning": "Commonsense",
    "Ethical and Safety Reasoning": "Ethics/Safety",
    "Domain-Specific Knowledge and Skills": "Domain-Specific",
    "Spatial and Temporal Reasoning": "Spatial/Temporal",
    "Planning and Decision Making": "Planning/Decision",
    # Input Format
    'User Interface Screenshots': "UI related", 
    'Text-Based Images and Documents': "Documents", 
    'Diagrams and Data Visualizations': "Infographics", 
    'Videos': "Videos", 
    'Artistic and Creative Content': "Arts/Creative", 
    'Photographs': "Photographs", 
    '3D Models and Aerial Imagery': "3D related",
    # Application
    'Information_Extraction': "Info Extraction", 
    'Planning' : "Planning", 
    'Coding': "Coding", 
    'Perception': "Perception", 
    'Metrics': "Metrics", 
    'Science': "Science", 
    'Knowledge': "Knowledge", 
    'Mathematics': "Math",
    # Output format
    'contextual_formatted_text': "Contexual", 
    'structured_output': "Structured", 
    'exact_text': "Exact", 
    'numerical_data': "Numerical", 
    'open_ended_output': "Open-ended", 
    'multiple_choice': "MC",
    "6-8 images": "6-8 imgs",
    "1-image": "1 img",
    "2-3 images": "2-3 imgs",
    "4-5 images": "4-5 imgs",
    "9-image or more": "9+ imgs",
    "video": "Video",
}

MODEL_URLS = {
    "Claude_3.5_new": "https://www.anthropic.com/news/3-5-models-and-computer-use",
    "GPT_4o": "https://platform.openai.com/docs/models/gpt-4o",
    "Claude_3.5": "https://www.anthropic.com/news/claude-3-5-sonnet", 
    "Gemini_1.5_pro_002": "https://ai.google.dev/gemini-api/docs/models/gemini",
    "Gemini_1.5_flash_002": "https://ai.google.dev/gemini-api/docs/models/gemini",
    "GPT_4o_mini": "https://platform.openai.com/docs/models#gpt-4o-mini",
    "Qwen2_VL_72B": "https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct",
    "InternVL2_76B": "https://huggingface.co/OpenGVLab/InternVL2-Llama3-76B",
    "llava_onevision_72B": "https://huggingface.co/lmms-lab/llava-onevision-qwen2-72b-ov-chat",
    "NVLM": "https://huggingface.co/nvidia/NVLM-D-72B",
    "Molmo_72B": "https://huggingface.co/allenai/Molmo-72B-0924",
    "Qwen2_VL_7B": "https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct",
    "Pixtral_12B": "https://huggingface.co/mistralai/Pixtral-12B-2409",
    "Aria": "https://huggingface.co/rhymes-ai/Aria",
    "InternVL2_8B": "https://huggingface.co/OpenGVLab/InternVL2-8B",
    "Phi-3.5-vision": "https://huggingface.co/microsoft/Phi-3.5-vision-instruct",
    "MiniCPM_v2.6": "https://huggingface.co/openbmb/MiniCPM-V-2_6",
    "llava_onevision_7B": "https://huggingface.co/lmms-lab/llava-onevision-qwen2-7b-ov",
    "Llama_3_2_11B": "https://huggingface.co/meta-llama/Llama-3.2-11B-Vision",
    "Idefics3": "https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3",
    "Molmo_7B_D": "https://huggingface.co/allenai/Molmo-7B-D-0924",
    "Aquila_VL_2B": "https://huggingface.co/BAAI/Aquila-VL-2B-llava-qwen",
    "POINTS_7B": "https://huggingface.co/WePOINTS/POINTS-Qwen-2-5-7B-Chat",
    "Qwen2_VL_2B": "https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct",
    "InternVL2_2B": "https://huggingface.co/OpenGVLab/InternVL2-2B",
    "POINTS_7B": "https://huggingface.co/WePOINTS/POINTS-Qwen-2-5-7B-Chat",
    "POINTS_15_7B": "https://huggingface.co/WePOINTS/POINTS-1-5-Qwen-2-5-7B-Chat",
    "SmolVLM": "https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct",
    "Mammoth_VL": "https://huggingface.co/MAmmoTH-VL/MAmmoTH-VL-8B",
    "InternVL2_5_78B": "https://huggingface.co/OpenGVLab/InternVL2_5-78B",
    "InternVL2_5_2B": "https://huggingface.co/OpenGVLab/InternVL2_5-2B",
    "InternVL2_5_8B": "https://huggingface.co/OpenGVLab/InternVL2_5-8B",
    "Grok-2-vision-1212": "https://x.ai/blog/grok-1212",
    "Gemini-2.0-thinking": "://ai.google.dev/gemini-api/docs/thinking-mode",
}

# Define the base MODEL_GROUPS structure
BASE_MODEL_GROUPS = {
    "All": list(MODEL_NAME_MAP.keys()),
    "Flagship Models": ['Claude_3.5_new', 'GPT_4o', 'Claude_3.5', 'Gemini_1.5_pro_002', 'Qwen2_VL_72B', 'InternVL2_76B', 'llava_onevision_72B', 'NVLM', 'Molmo_72B', 'InternVL2_5_78B', 'Grok-2-vision-1212', "Gemini-2.0-thinking"],
    "Efficiency Models": ['Gemini_1.5_flash_002', 'GPT_4o_mini', 'Qwen2_VL_7B', 'Pixtral_12B', 'Aria', 'InternVL2_8B', 'Phi-3.5-vision', 'MiniCPM_v2.6', 'llava_onevision_7B', 'Llama_3_2_11B', 'Idefics3', 'Molmo_7B_D', "Aquila_VL_2B", "POINTS_7B", "Qwen2_VL_2B", "InternVL2_2B", "InternVL2_5_2B", "InternVL2_5_8B"],
    "Proprietary Flagship models": ['Claude_3.5_new', 'GPT_4o', 'Claude_3.5', 'Gemini_1.5_pro_002', 'Grok-2-vision-1212', "Gemini-2.0-thinking"],
    "Proprietary Efficiency Models": ['Gemini_1.5_flash_002', 'GPT_4o_mini'],
    "Open-source Flagship Models": ['Qwen2_VL_72B', 'InternVL2_76B', 'llava_onevision_72B', 'NVLM', "Molmo_72B", "InternVL2_5_78B"],
    "Open-source Efficiency Models": ['Qwen2_VL_7B', 'Pixtral_12B', 'Aria', 'InternVL2_8B', 'Phi-3.5-vision', 'MiniCPM_v2.6', 'llava_onevision_7B', 'Llama_3_2_11B', 'Idefics3', 'Molmo_7B_D', "Aquila_VL_2B", "POINTS_7B", "Qwen2_VL_2B", "InternVL2_2B", "InternVL2_5_2B", "InternVL2_5_8B"]
}