arxiv:2411.17465

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Published on Nov 26

· Submitted by

KevinQHLin on Nov 27

#2 Paper of the day

Upvote

Authors:

Kevin Qinghong Lin ,

Abstract

Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity. While most agents are language-based, relying on closed-source API with text-rich meta-information (e.g., HTML or accessibility tree), they show limitations in perceiving UI visuals as humans do, highlighting the need for GUI visual agents. In this work, we develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations: (i) UI-Guided Visual Token Selection to reduce computational costs by formulating screenshots as an UI connected graph, adaptively identifying their redundant relationship and serve as the criteria for token selection during self-attention blocks; (ii) Interleaved Vision-Language-Action Streaming that flexibly unifies diverse needs within GUI tasks, enabling effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency; (iii) Small-scale High-quality GUI Instruction-following Datasets by careful data curation and employing a resampling strategy to address significant data type imbalances. With above components, ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding. Its UI-guided token selection further reduces 33% of redundant visual tokens during training and speeds up the performance by 1.4x. Navigation experiments across web Mind2Web, mobile AITW, and online MiniWob environments further underscore the effectiveness and potential of our model in advancing GUI visual agents. The models are available at https://github.com/showlab/ShowUI.

View arXiv page View PDF Add to collection

Community

KevinQHLin

Paper author Paper submitter 28 days ago

•

edited 28 days ago

TLDR: ShowUI is a lightweight vision-language-action model for GUI agents.

Github: https://github.com/showlab/ShowUI/
ArXiv: https://arxiv.org/abs/2411.17465
HF Models: https://huggingface.co/showlab/ShowUI-2B
HF Spaces: https://huggingface.co/spaces/showlab/ShowUI
HF Datasets: https://huggingface.co/datasets/showlab/ShowUI-desktop-8K

librarian-bot

27 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

m-ric

21 days ago

My read from this paper:

A team from NUS and Microsoft just released an agent that can act on any UI (Desktop, Android, Web) without needing additional text information. It works extremely well : they applied their method on a tiny Qwen2-VL-2B, and they managed to beat methods that use either much more powerful vision models (like GPT-4V) without using any additional info (e.g. leveraging the DOM of a webpage) like previous methods did ! 👏👏

They started from the idea that most existing methods rely heavily on text, which makes them less generalizable, while letting aside rich UI structure that user actually rely on when navigating this interfaces.

⚙️ They put several good ideas to work:

💡 Simplify screenshots to the max:
They prune a lot the heavy visual content of UI screenshots, by removing cloned image patches (like any vast patch of the same color will be reduced to a small patch, while maintaining positional embeddings), then group patches from the same GUI elements together to simplify even further

💡 Build a truly generalist dataset:
To train a general UI agent, you need trajectories from each possible UI, and express them in a common language. Authors merge datasets like OmniAct for Desktop, Mind2Web for websites, AMEX for Android trajectories to create a high-quality and diverse dataset.

➡️ Nice results ensued:
They fine-tune a tiny Qwen-2-VL-2B on their method, and it reaches SOTA on several task (element identification, web navigation), even beating methods that either use additional info from the DOM or use much bigger VLMS like GPT-4v! 🏆

And performance could certainly jump with a slightly bigger vision model. Let's hope the community builds this soon! 🚀

Paper added to my "Agents" collection 👉
m-ric/agents-65ba776fbd9e29f771c07d4e

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Abstract

Community

Models citing this paper 1

Datasets citing this paper 1

Spaces citing this paper 4

Collections including this paper 23