InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback
Abstract
Existing benchmarks do not test Large Multimodal Models (LMMs) on their interactive intelligence with human users which is vital for developing general-purpose AI assistants. We design InterFeedback, an interactive framework, which can be applied to any LMM and dataset to assess this ability autonomously. On top of this, we introduce InterFeedback-Bench which evaluates interactive intelligence using two representative datasets, MMMU-Pro and MathVerse, to test 10 different open-source LMMs. Additionally, we present InterFeedback-Human, a newly collected dataset of 120 cases designed for manually testing interactive performance in leading models such as OpenAI-o1 and Claude-3.5-Sonnet. Our evaluation results show that even state-of-the-art LMM (like OpenAI-o1) can correct their results through human feedback less than 50%. Our findings point to the need for methods that can enhance the LMMs' capability to interpret and benefit from feedback.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- EmbodiedEval: Evaluate Multimodal LLMs as Embodied Agents (2025)
- Multimodal RewardBench: Holistic Evaluation of Reward Models for Vision Language Models (2025)
- MM-RLHF: The Next Step Forward in Multimodal LLM Alignment (2025)
- Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark (2025)
- Towards Interactive Deepfake Analysis (2025)
- EmoBench-M: Benchmarking Emotional Intelligence for Multimodal Large Language Models (2025)
- Human-AI Collaborative Game Testing with Vision Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper