cccjc commited on
Commit
ca8eb95
Β·
1 Parent(s): d9bed00

update readme doc

Browse files
Files changed (1) hide show
  1. constants.py +13 -3
constants.py CHANGED
@@ -28,10 +28,20 @@ We aim to provide cost-effective and accurate evaluation for multimodal models,
28
 
29
  ## πŸ“ŠπŸ” Results & Takeaways from Evaluating Top Models
30
 
31
- - GPT-4o (0513) and Claude 3.5 Sonnet (1022) lead the benchmark. Claude 3.5 Sonnet (1022) improves over Claude 3.5 Sonnet (0620) obviously in planning tasks (application dimension) and UI/Infographics inputs (input format dimension).
32
- - Qwen2-VL stands out among open-source models, and its flagship model gets close to some proprietary flagship models
 
 
 
 
 
 
 
 
 
 
33
  - Chain-of-Thought (CoT) prompting improves proprietary models but has limited impact on open-source models
34
- - Gemini 1.5 Flash performs the best among all the evaluated efficiency models, but struggles with UI and document tasks
35
  - Many open-source models face challenges in adhering to output format instructions
36
 
37
  ## 🎯 Interactive Visualization
 
28
 
29
  ## πŸ“ŠπŸ” Results & Takeaways from Evaluating Top Models
30
 
31
+
32
+ ### ️‍πŸ”₯πŸ“ January 2025
33
+
34
+ - **Gemini 2.0 Experimental (1206)** and **Gemini 2.0 Flash Experimental** outperform **GPT-4o** and **Claude 3.5 Sonnet**.
35
+ - We add **Grok-2-vision-1212** to the single-image leaderboard. The model seems to use a lot of tokens per image, and cannot run many of our multi-image and video tasks.
36
+ - We will evaluate o1 series models when there is budget.
37
+
38
+
39
+ ### πŸ“ November 2024
40
+
41
+ - **GPT-4o (0513)** and **Claude 3.5 Sonnet (1022)** lead the benchmark. **Claude 3.5 Sonnet (1022)** improves over **Claude 3.5 Sonnet (0620)** obviously in planning tasks (application dimension) and UI/Infographics inputs (input format dimension).
42
+ - **Qwen2-VL** stands out among open-source models, and its flagship model gets close to some proprietary flagship models
43
  - Chain-of-Thought (CoT) prompting improves proprietary models but has limited impact on open-source models
44
+ - **Gemini 1.5 Flash** performs the best among all the evaluated efficiency models, but struggles with UI and document tasks
45
  - Many open-source models face challenges in adhering to output format instructions
46
 
47
  ## 🎯 Interactive Visualization