Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models
Abstract
While the situation has improved for text-only models, it again seems to be the case currently that multimodal (text and image) models develop faster than ways to evaluate them. In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models, namely evaluation through the goal-oriented game (self) play, complementing reference-based and preference-based evaluation. Specifically, we define games that challenge a model's capability to represent a situation from visual information and align such representations through dialogue. We find that the largest closed models perform rather well on the games that we define, while even the best open-weight models struggle with them. On further analysis, we find that the exceptional deep captioning capabilities of the largest models drive some of the performance. There is still room to grow for both kinds of models, ensuring the continued relevance of the benchmark.
Community
The paper proposes a new benchmark for evaluating Multimodal LLM for their instruction following, long-context dialogue understanding, goal-oriented dialogue state tracking while all of this happening automatically without any need for human annotated data points. The main contribution is the idea of making the LLMs self-play with each other and evaluating whether the models reach the goal state. The self-play is imposed by prompting the rules, goal state etc. and the states are tracked by a programmatic game master that controls the flow of the played dialogue games.
Current benchmark includes three different games that have both textual and multimodal (image+text) variants. The table below shows the current state of the multimodal commercial and open-weight models.
The current framework supports addition of new games with minimal effort while already encapsulates models that support multimodal input (image+text).
Here is one example dialogue for the game MatchIt where two players (LLM) need to decide whether they are given the same image or not by discussing the content of images in text.
Here is another example for the game Reference where one player describes the target by differentiating it from distractor items and the other player needs to guess which item is target by looking at the generated referring expression. It can be seen that most models struggle with such items (Pentomino puzzle pieces) and we speculate that it is due to the the limits of their vision encoders.
Here is another example for the game MapNavigation where a player is expected to navigate a map with the minimal number of steps (efficiency) while visiting all rooms (exploration).
Overall, models are capable of playing most of these game episodes but still lack certain capabilities to achieve high success scores. Most of these capabilities can be related to the limits of vision encoders (Reference game with Pentomino pieces), reasoning abilities (sometimes in MapNavigation models are stuck in a loop between same rooms). The difference in performance between commercial and open-weight models is substantial compared to how text-only open-weight models (e.g. Llama-3) catching up to GPT models.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper