UI-TARS-2B-SFT / README.md
nielsr's picture
nielsr HF staff
Add link to paper
d65cedd verified
|
raw
history blame
9.3 kB
metadata
license: apache-2.0
language:
  - en
pipeline_tag: image-text-to-text
tags:
  - multimodal
  - gui
library_name: transformers

UI-TARS-2B-SFT

UI-TARS-2B-SFT  |  UI-TARS-2B-gguf  |  UI-TARS-7B-SFT  |  UI-TARS-7B-DPO(Recommended)  |  UI-TARS-7B-gguf  |  UI-TARS-72B-SFT  |  UI-TARS-72B-DPO(Recommended)

Introduction

UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules.

This repository contains the model for the paper UI-TARS: Pioneering Automated GUI Interaction with Native Agents.

Code: https://github.com/bytedance/UI-TARS

Performance

Perception Capabilty Evaluation

Model VisualWebBench WebSRC SQAshort
Qwen2-VL-7B 73.3 81.8 84.9
Qwen-VL-Max 74.1 91.1 78.6
Gemini-1.5-Pro 75.4 88.9 82.2
UIX-Qwen2-7B 75.9 82.9 78.8
Claude-3.5-Sonnet 78.2 90.4 83.1
GPT-4o 78.5 87.7 82.3
UI-TARS-2B 72.9 89.2 86.4
UI-TARS-7B 79.7 93.6 87.7
UI-TARS-72B 82.8 89.3 88.6

Grounding Capability Evaluation

  • ScreenSpot Pro
Agent Model Dev-Text Dev-Icon Dev-Avg Creative-Text Creative-Icon Creative-Avg CAD-Text CAD-Icon CAD-Avg Scientific-Text Scientific-Icon Scientific-Avg Office-Text Office-Icon Office-Avg OS-Text OS-Icon OS-Avg Avg-Text Avg-Icon Avg
QwenVL-7B 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7 0.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.1
GPT-4o 1.3 0.0 0.7 1.0 0.0 0.6 2.0 0.0 1.5 2.1 0.0 1.2 1.1 0.0 0.9 0.0 0.0 0.0 1.3 0.0 0.8
SeeClick 0.6 0.0 0.3 1.0 0.0 0.6 2.5 0.0 1.9 3.5 0.0 2.0 1.1 0.0 0.9 2.8 0.0 1.5 1.8 0.0 1.1
Qwen2-VL-7B 2.6 0.0 1.3 1.5 0.0 0.9 0.5 0.0 0.4 6.3 0.0 3.5 3.4 1.9 3.0 0.9 0.0 0.5 2.5 0.2 1.6
OS-Atlas-4B 7.1 0.0 3.7 3.0 1.4 2.3 2.0 0.0 1.5 9.0 5.5 7.5 5.1 3.8 4.8 5.6 0.0 3.1 5.0 1.7 3.7
ShowUI-2B 16.9 1.4 9.4 9.1 0.0 5.3 2.5 0.0 1.9 13.2 7.3 10.6 15.3 7.5 13.5 10.3 2.2 6.6 10.8 2.6 7.7
CogAgent-18B 14.9 0.7 8.0 9.6 0.0 5.6 7.1 3.1 6.1 22.2 1.8 13.4 13.0 0.0 10.0 5.6 0.0 3.1 12.0 0.8 7.7
Aria-UI 16.2 0.0 8.4 23.7 2.1 14.7 7.6 1.6 6.1 27.1 6.4 18.1 20.3 1.9 16.1 4.7 0.0 2.6 17.1 2.0 11.3
UGround-7B 26.6 2.1 14.7 27.3 2.8 17.0 14.2 1.6 11.1 31.9 2.7 19.3 31.6 11.3 27.0 17.8 0.0 9.7 25.0 2.8 16.5
Claude Computer Use 22.0 3.9 12.6 25.9 3.4 16.8 14.5 3.7 11.9 33.9 15.8 25.8 30.1 16.3 26.9 11.0 4.5 8.1 23.4 7.1 17.1
OS-Atlas-7B 33.1 1.4 17.7 28.8 2.8 17.9 12.2 4.7 10.3 37.5 7.3 24.4 33.9 5.7 27.4 27.1 4.5 16.8 28.1 4.0 18.9
UGround-V1-7B - - 35.5 - - 27.8 - - 13.5 - - 38.8 - - 48.8 - - 26.1 - - 31.1
UI-TARS-2B 47.4 4.1 26.4 42.9 6.3 27.6 17.8 4.7 14.6 56.9 17.3 39.8 50.3 17.0 42.6 21.5 5.6 14.3 39.6 8.4 27.7
UI-TARS-7B 58.4 12.4 36.1 50.0 9.1 32.8 20.8 9.4 18.0 63.9 31.8 50.0 63.3 20.8 53.5 30.8 16.9 24.5 47.8 16.2 35.7
UI-TARS-72B 63.0 17.3 40.8 57.1 15.4 39.6 18.8 12.5 17.2 64.6 20.9 45.7 63.3 26.4 54.8 42.1 15.7 30.1 50.9 17.5 38.1
  • ScreenSpot v2
Method Mobile-Text Mobile-Icon/Widget Desktop-Text Desktop-Icon/Widget Web-Text Web-Icon/Widget Avg
Agent Framework
GPT-4o (SeeClick) 85.2 58.8 79.9 37.1 72.7 30.1 63.6
GPT-4o (OS-Atlas-4B) 95.5 75.8 79.4 49.3 90.2 66.5 79.1
GPT-4o (OS-Atlas-7B) 96.2 83.4 89.7 69.3 94.0 79.8 87.1
Agent Model
SeeClick 78.4 50.7 70.1 29.3 55.2 32.5 55.1
OS-Atlas-4B 87.2 59.7 72.7 46.4 85.9 63.1 71.9
OS-Atlas-7B 95.2 75.8 90.7 63.6 90.6 77.3 84.1
Our Model
UI-TARS-2B 95.2 79.1 90.7 68.6 87.2 78.3 84.7
UI-TARS-7B 96.9 89.1 95.4 85.0 93.6 85.2 91.6
UI-TARS-72B 94.8 86.3 91.2 87.9 91.5 87.7 90.3

Online Agent Capability Evaluation

Method OSWorld (Online) AndroidWorld (Online)
Agent Framework
GPT-4o (UGround) - 32.8
GPT-4o (Aria-UI) 15.2 44.8
GPT-4o (Aguvis-7B) 14.8 37.1
GPT-4o (Aguvis-72B) 17.0 -
GPT-4o (OS-Atlas-7B) 14.6 -
Agent Model
GPT-4o 5.0 34.5 (SoM)
Gemini-Pro-1.5 5.4 22.8 (SoM)
Aguvis-72B 10.3 26.1
Claude Computer-Use 14.9 (15 steps) 27.9
Claude Computer-Use 22.0 (50 steps) -
Our Model
UI-TARS-7B-SFT 17.7 (15 steps) 33.0
UI-TARS-7B-DPO 18.7 (15 steps) -
UI-TARS-72B-SFT 18.8 (15 steps) 46.6
UI-TARS-72B-DPO 22.7 (15 steps) -
UI-TARS-72B-DPO 24.6 (50 steps) -

How to use

The usage will be the same as LLaVA with a few additional tokens. Please refer to https://github.com/haotian-liu/LLaVA

License

UI-TARS is licensed under the Apache License 2.0.