ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Abstract
Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-language model that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to Large Language Models and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.
Community
Exciting release! Wonder when the code and demo comes out..
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions (2024)
- DocLLM: A layout-aware generative language model for multimodal document understanding (2023)
- InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models (2023)
- 3DMIT: 3D Multi-modal Instruction Tuning for Scene Understanding (2024)
- MouSi: Poly-Visual-Expert Vision-Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Could try using Crit Design AI extension for now. It basically do the same job as ScreenAI atm and can also switch between GPT4 vision, Gemini Vision Pro, Claude Vision and LLaVa34B
ScreenAI: The Future of UI and Infographics Understanding Unveiled!
Links π:
π Subscribe: https://www.youtube.com/@Arxflix
π Twitter: https://x.com/arxflix
π LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper