Spaces:

TIGER-Lab
/

MEGA-Bench

Running

cccjc commited on Nov 9, 2024

Commit

44b6d4e

1 Parent(s): 4301eca

Update doc info and table style

Files changed (3) hide show

constants.py CHANGED Viewed

@@ -28,10 +28,10 @@ We aim to provide cost-effective and accurate evaluation for multimodal models,
 ## 📊🔍 Results & Takeaways from Evaluating Top Models
-- GPT4o leads the benchmark, outperforming others by 3.5% over Claude3.5
-- Qwen2VL stands out among open-source models, nearing flagship-level performance
-- Chain-of-Thought (CoT) improves proprietary models but has limited impact on open-source models
-- Efficiency models like Gemini 1.5 Flash perform well but struggle with UI and document tasks
 - Many open-source models face challenges in adhering to output format instructions
 ## 🎯 Interactive Visualization

 ## 📊🔍 Results & Takeaways from Evaluating Top Models
+- GPT-4o (0513) and Claude 3.5 Sonnet (1022) lead the benchmark. Claude 3.5 Sonnet (1022) improves over Claude 3.5 Sonnet (0622) obviously in planning tasks (application dimension) and UI/Infographics inputs (input format dimension).
+- Qwen2-VL stands out among open-source models, and its flagship model gets close to some proprietary flagship models
+- Chain-of-Thought (CoT) prompting improves proprietary models but has limited impact on open-source models
+- Gemini 1.5 Flash performs the best among all the evaluated efficiency models, but struggles with UI and document tasks
 - Many open-source models face challenges in adhering to output format instructions
 ## 🎯 Interactive Visualization

static/css/style.css CHANGED Viewed

@@ -45,3 +45,14 @@
     margin-top: 10px;
     color: var(--text-color);
 }

     margin-top: 10px;
     color: var(--text-color);
 }
+.custom-dataframe td:first-child {
+    min-width: 220px !important;  /* Adjust minimum width for model names */
+    white-space: nowrap !important;  /* Prevent text wrapping */
+}
+.custom-dataframe a {
+    text-decoration: none;
+    color: #2196F3;
+    white-space: nowrap !important;
+}

utils.py CHANGED Viewed

@@ -241,10 +241,10 @@ class DefaultDataLoader(BaseDataLoader):
         # Define headers with task counts
         column_headers = {
             "Models": "Models",
-            "Overall": f"Overall ({total_tasks})",
-            "Core w/o CoT": f"Core(w/o CoT) ({total_core_tasks})",
-            "Core w/ CoT": f"Core(w/ CoT) ({total_core_tasks})",
-            "Open-ended": f"Open-ended ({total_open_tasks})"
         }
         # Rename the columns in DataFrame to match headers
@@ -317,9 +317,9 @@ class SingleImageDataLoader(BaseDataLoader):
         # Define headers with task counts
         column_headers = {
             "Models": "Models",
-            "Overall": f"Overall ({total_tasks})",
-            "Core": f"Core ({total_core_tasks})",
-            "Open-ended": f"Open-ended ({total_open_tasks})"
         }
         # Rename the columns in DataFrame to match headers

         # Define headers with task counts
         column_headers = {
             "Models": "Models",
+            "Overall": f"Overall({total_tasks})",
+            "Core w/o CoT": f"Core w/o CoT({total_core_tasks})",
+            "Core w/ CoT": f"Core w/ CoT({total_core_tasks})",
+            "Open-ended": f"Open-ended({total_open_tasks})"
         }
         # Rename the columns in DataFrame to match headers
         # Define headers with task counts
         column_headers = {
             "Models": "Models",
+            "Overall": f"Overall({total_tasks})",
+            "Core": f"Core({total_core_tasks})",
+            "Open-ended": f"Open-ended({total_open_tasks})"
         }
         # Rename the columns in DataFrame to match headers