metadata

license: apache-2.0
language:
  - en
pipeline_tag: image-text-to-text
tags:
  - multimodal
  - gui
library_name: transformers

UI-TARS-2B-SFT

Introduction

UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules.

This repository contains the model for the paper UI-TARS: Pioneering Automated GUI Interaction with Native Agents.

Code: https://github.com/bytedance/UI-TARS

Performance

Perception Capabilty Evaluation

Model	VisualWebBench	WebSRC	SQAshort
Qwen2-VL-7B	73.3	81.8	84.9
Qwen-VL-Max	74.1	91.1	78.6
Gemini-1.5-Pro	75.4	88.9	82.2
UIX-Qwen2-7B	75.9	82.9	78.8
Claude-3.5-Sonnet	78.2	90.4	83.1
GPT-4o	78.5	87.7	82.3
UI-TARS-2B	72.9	89.2	86.4
UI-TARS-7B	79.7	93.6	87.7
UI-TARS-72B	82.8	89.3	88.6

Grounding Capability Evaluation

ScreenSpot Pro

Agent Model	Dev-Text	Dev-Icon	Dev-Avg	Creative-Text	Creative-Icon	Creative-Avg	CAD-Text	CAD-Icon	CAD-Avg	Scientific-Text	Scientific-Icon	Scientific-Avg	Office-Text	Office-Icon	Office-Avg	OS-Text	OS-Icon	OS-Avg	Avg-Text	Avg-Icon	Avg
QwenVL-7B	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.7	0.0	0.4	0.0	0.0	0.0	0.0	0.0	0.0	0.1	0.0	0.1
GPT-4o	1.3	0.0	0.7	1.0	0.0	0.6	2.0	0.0	1.5	2.1	0.0	1.2	1.1	0.0	0.9	0.0	0.0	0.0	1.3	0.0	0.8
SeeClick	0.6	0.0	0.3	1.0	0.0	0.6	2.5	0.0	1.9	3.5	0.0	2.0	1.1	0.0	0.9	2.8	0.0	1.5	1.8	0.0	1.1
Qwen2-VL-7B	2.6	0.0	1.3	1.5	0.0	0.9	0.5	0.0	0.4	6.3	0.0	3.5	3.4	1.9	3.0	0.9	0.0	0.5	2.5	0.2	1.6
OS-Atlas-4B	7.1	0.0	3.7	3.0	1.4	2.3	2.0	0.0	1.5	9.0	5.5	7.5	5.1	3.8	4.8	5.6	0.0	3.1	5.0	1.7	3.7
ShowUI-2B	16.9	1.4	9.4	9.1	0.0	5.3	2.5	0.0	1.9	13.2	7.3	10.6	15.3	7.5	13.5	10.3	2.2	6.6	10.8	2.6	7.7
CogAgent-18B	14.9	0.7	8.0	9.6	0.0	5.6	7.1	3.1	6.1	22.2	1.8	13.4	13.0	0.0	10.0	5.6	0.0	3.1	12.0	0.8	7.7
Aria-UI	16.2	0.0	8.4	23.7	2.1	14.7	7.6	1.6	6.1	27.1	6.4	18.1	20.3	1.9	16.1	4.7	0.0	2.6	17.1	2.0	11.3
UGround-7B	26.6	2.1	14.7	27.3	2.8	17.0	14.2	1.6	11.1	31.9	2.7	19.3	31.6	11.3	27.0	17.8	0.0	9.7	25.0	2.8	16.5
Claude Computer Use	22.0	3.9	12.6	25.9	3.4	16.8	14.5	3.7	11.9	33.9	15.8	25.8	30.1	16.3	26.9	11.0	4.5	8.1	23.4	7.1	17.1
OS-Atlas-7B	33.1	1.4	17.7	28.8	2.8	17.9	12.2	4.7	10.3	37.5	7.3	24.4	33.9	5.7	27.4	27.1	4.5	16.8	28.1	4.0	18.9
UGround-V1-7B	-	-	35.5	-	-	27.8	-	-	13.5	-	-	38.8	-	-	48.8	-	-	26.1	-	-	31.1
UI-TARS-2B	47.4	4.1	26.4	42.9	6.3	27.6	17.8	4.7	14.6	56.9	17.3	39.8	50.3	17.0	42.6	21.5	5.6	14.3	39.6	8.4	27.7
UI-TARS-7B	58.4	12.4	36.1	50.0	9.1	32.8	20.8	9.4	18.0	63.9	31.8	50.0	63.3	20.8	53.5	30.8	16.9	24.5	47.8	16.2	35.7
UI-TARS-72B	63.0	17.3	40.8	57.1	15.4	39.6	18.8	12.5	17.2	64.6	20.9	45.7	63.3	26.4	54.8	42.1	15.7	30.1	50.9	17.5	38.1

ScreenSpot v2

Method	Mobile-Text	Mobile-Icon/Widget	Desktop-Text	Desktop-Icon/Widget	Web-Text	Web-Icon/Widget	Avg
Agent Framework
GPT-4o (SeeClick)	85.2	58.8	79.9	37.1	72.7	30.1	63.6
GPT-4o (OS-Atlas-4B)	95.5	75.8	79.4	49.3	90.2	66.5	79.1
GPT-4o (OS-Atlas-7B)	96.2	83.4	89.7	69.3	94.0	79.8	87.1
Agent Model
SeeClick	78.4	50.7	70.1	29.3	55.2	32.5	55.1
OS-Atlas-4B	87.2	59.7	72.7	46.4	85.9	63.1	71.9
OS-Atlas-7B	95.2	75.8	90.7	63.6	90.6	77.3	84.1
Our Model
UI-TARS-2B	95.2	79.1	90.7	68.6	87.2	78.3	84.7
UI-TARS-7B	96.9	89.1	95.4	85.0	93.6	85.2	91.6
UI-TARS-72B	94.8	86.3	91.2	87.9	91.5	87.7	90.3

Online Agent Capability Evaluation

Method	OSWorld (Online)	AndroidWorld (Online)
Agent Framework
GPT-4o (UGround)	-	32.8
GPT-4o (Aria-UI)	15.2	44.8
GPT-4o (Aguvis-7B)	14.8	37.1
GPT-4o (Aguvis-72B)	17.0	-
GPT-4o (OS-Atlas-7B)	14.6	-
Agent Model
GPT-4o	5.0	34.5 (SoM)
Gemini-Pro-1.5	5.4	22.8 (SoM)
Aguvis-72B	10.3	26.1
Claude Computer-Use	14.9 (15 steps)	27.9
Claude Computer-Use	22.0 (50 steps)	-
Our Model
UI-TARS-7B-SFT	17.7 (15 steps)	33.0
UI-TARS-7B-DPO	18.7 (15 steps)	-
UI-TARS-72B-SFT	18.8 (15 steps)	46.6
UI-TARS-72B-DPO	22.7 (15 steps)	-
UI-TARS-72B-DPO	24.6 (50 steps)	-

How to use

The usage will be the same as LLaVA with a few additional tokens. Please refer to https://github.com/haotian-liu/LLaVA

License

UI-TARS is licensed under the Apache License 2.0.