Add link to paper

#1
by nielsr HF staff - opened
Files changed (1) hide show
  1. README.md +354 -194
README.md CHANGED
@@ -1,195 +1,355 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- pipeline_tag: image-text-to-text
6
- tags:
7
- - multimodal
8
- - gui
9
- library_name: transformers
10
- ---
11
-
12
-
13
- # UI-TARS-7B-SFT
14
- [UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT)  | 
15
- [UI-TARS-2B-gguf](https://huggingface.co/bytedance-research/UI-TARS-2B-gguf)  | 
16
- [UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT)  | 
17
- [UI-TARS-7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO)  | 
18
- [UI-TARS-7B-gguf](https://huggingface.co/bytedance-research/UI-TARS-7B-gguf)  | 
19
- [UI-TARS-72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT)  | 
20
- [UI-TARS-72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)
21
- ## Introduction
22
-
23
- UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules.
24
- <!-- ![Local Image](figures/UI-TARS.png) -->
25
- <p align="center">
26
- <img src="https://github.com/bytedance/UI-TARS/blob/main/figures/UI-TARS-vs-Previous-SOTA.png?raw=true" width="90%"/>
27
- <p>
28
- <p align="center">
29
- <img src="https://github.com/bytedance/UI-TARS/blob/main/figures/UI-TARS.png?raw=true" width="90%"/>
30
- <p>
31
-
32
- <!-- ![Local Image](figures/UI-TARS-vs-Previous-SOTA.png) -->
33
-
34
-
35
- ## Performance
36
- **Perception Capabilty Evaluation**
37
- | Model | VisualWebBench | WebSRC | SQAshort |
38
- |---------------------------|---------------|---------|----------|
39
- | Qwen2-VL-7B | 73.3 | 81.8 | 84.9 |
40
- | Qwen-VL-Max | 74.1 | 91.1 | 78.6 |
41
- | Gemini-1.5-Pro | 75.4 | 88.9 | 82.2 |
42
- | UIX-Qwen2-7B | 75.9 | 82.9 | 78.8 |
43
- | Claude-3.5-Sonnet | 78.2 | 90.4 | 83.1 |
44
- | GPT-4o | 78.5 | 87.7 | 82.3 |
45
- | **UI-TARS-2B** | 72.9 | 89.2 | 86.4 |
46
- | **UI-TARS-7B** | 79.7 | **93.6** | 87.7 |
47
- | **UI-TARS-72B** | **82.8** | 89.3 | **88.6** |
48
-
49
- **Grounding Capability Evaluation**
50
- - **ScreenSpot Pro**
51
-
52
- | Agent Model | Dev-Text | Dev-Icon | Dev-Avg | Creative-Text | Creative-Icon | Creative-Avg | CAD-Text | CAD-Icon | CAD-Avg | Scientific-Text | Scientific-Icon | Scientific-Avg | Office-Text | Office-Icon | Office-Avg | OS-Text | OS-Icon | OS-Avg | Avg-Text | Avg-Icon | Avg |
53
- |--------------------------|----------|----------|----------|--------------|--------------|--------------|---------|---------|---------|---------------|---------------|---------------|------------|------------|------------|--------|--------|--------|---------|---------|------|
54
- | QwenVL-7B | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.7 | 0.0 | 0.4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | **0.1** |
55
- | GPT-4o | 1.3 | 0.0 | 0.7 | 1.0 | 0.0 | 0.6 | 2.0 | 0.0 | 1.5 | 2.1 | 0.0 | 1.2 | 1.1 | 0.0 | 0.9 | 0.0 | 0.0 | 0.0 | 1.3 | 0.0 | **0.8** |
56
- | SeeClick | 0.6 | 0.0 | 0.3 | 1.0 | 0.0 | 0.6 | 2.5 | 0.0 | 1.9 | 3.5 | 0.0 | 2.0 | 1.1 | 0.0 | 0.9 | 2.8 | 0.0 | 1.5 | 1.8 | 0.0 | **1.1** |
57
- | Qwen2-VL-7B | 2.6 | 0.0 | 1.3 | 1.5 | 0.0 | 0.9 | 0.5 | 0.0 | 0.4 | 6.3 | 0.0 | 3.5 | 3.4 | 1.9 | 3.0 | 0.9 | 0.0 | 0.5 | 2.5 | 0.2 | **1.6** |
58
- | OS-Atlas-4B | 7.1 | 0.0 | 3.7 | 3.0 | 1.4 | 2.3 | 2.0 | 0.0 | 1.5 | 9.0 | 5.5 | 7.5 | 5.1 | 3.8 | 4.8 | 5.6 | 0.0 | 3.1 | 5.0 | 1.7 | **3.7** |
59
- | ShowUI-2B | 16.9 | 1.4 | 9.4 | 9.1 | 0.0 | 5.3 | 2.5 | 0.0 | 1.9 | 13.2 | 7.3 | 10.6 | 15.3 | 7.5 | 13.5 | 10.3 | 2.2 | 6.6 | 10.8 | 2.6 | **7.7** |
60
- | CogAgent-18B | 14.9 | 0.7 | 8.0 | 9.6 | 0.0 | 5.6 | 7.1 | 3.1 | 6.1 | 22.2 | 1.8 | 13.4 | 13.0 | 0.0 | 10.0 | 5.6 | 0.0 | 3.1 | 12.0 | 0.8 | **7.7** |
61
- | Aria-UI | 16.2 | 0.0 | 8.4 | 23.7 | 2.1 | 14.7 | 7.6 | 1.6 | 6.1 | 27.1 | 6.4 | 18.1 | 20.3 | 1.9 | 16.1 | 4.7 | 0.0 | 2.6 | 17.1 | 2.0 | **11.3** |
62
- | UGround-7B | 26.6 | 2.1 | 14.7 | 27.3 | 2.8 | 17.0 | 14.2 | 1.6 | 11.1 | 31.9 | 2.7 | 19.3 | 31.6 | 11.3 | 27.0 | 17.8 | 0.0 | 9.7 | 25.0 | 2.8 | **16.5** |
63
- | Claude Computer Use | 22.0 | 3.9 | 12.6 | 25.9 | 3.4 | 16.8 | 14.5 | 3.7 | 11.9 | 33.9 | 15.8 | 25.8 | 30.1 | 16.3 | 26.9 | 11.0 | 4.5 | 8.1 | 23.4 | 7.1 | **17.1** |
64
- | OS-Atlas-7B | 33.1 | 1.4 | 17.7 | 28.8 | 2.8 | 17.9 | 12.2 | 4.7 | 10.3 | 37.5 | 7.3 | 24.4 | 33.9 | 5.7 | 27.4 | 27.1 | 4.5 | 16.8 | 28.1 | 4.0 | **18.9** |
65
- | UGround-V1-7B | - | - | 35.5 | - | - | 27.8 | - | - | 13.5 | - | - | 38.8 | - | - | 48.8 | - | - | 26.1 | - | - | **31.1** |
66
- | **UI-TARS-2B** | 47.4 | 4.1 | 26.4 | 42.9 | 6.3 | 27.6 | 17.8 | 4.7 | 14.6 | 56.9 | 17.3 | 39.8 | 50.3 | 17.0 | 42.6 | 21.5 | 5.6 | 14.3 | 39.6 | 8.4 | **27.7** |
67
- | **UI-TARS-7B** | 58.4 | 12.4 | 36.1 | 50.0 | 9.1 | 32.8 | **20.8**| 9.4 | **18.0**| 63.9 | **31.8** | **50.0** | **63.3** | 20.8 | 53.5 | 30.8 | **16.9**| 24.5 | 47.8 | 16.2 | **35.7** |
68
- | **UI-TARS-72B** | **63.0** | **17.3** | **40.8** | **57.1** | **15.4** | **39.6** | 18.8 | **12.5**| 17.2 | **64.6** | 20.9 | 45.7 | **63.3** | **26.4** | **54.8** | **42.1**| 15.7 | **30.1**| **50.9**| **17.5**| **38.1** |
69
-
70
-
71
- - **ScreenSpot**
72
-
73
- | Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
74
- |--------|-------------|-------------|-------------|-------------|-------------|---------|---------|
75
- | **Agent Framework** | | | | | | | |
76
- | GPT-4 (SeeClick) | 76.6 | 55.5 | 68.0 | 28.6 | 40.9 | 23.3 | **48.8** |
77
- | GPT-4 (OmniParser) | 93.9 | 57.0 | 91.3 | 63.6 | 81.3 | 51.0 | **73.0** |
78
- | GPT-4 (UGround-7B) | 90.1 | 70.3 | 87.1 | 55.7 | 85.7 | 64.6 | **75.6** |
79
- | GPT-4o (SeeClick) | 81.0 | 59.8 | 69.6 | 33.6 | 43.9 | 26.2 | **52.3** |
80
- | GPT-4o (UGround-7B) | 93.4 | 76.9 | 92.8 | 67.9 | 88.7 | 68.9 | **81.4** |
81
- | **Agent Model** | | | | | | | |
82
- | GPT-4 | 22.6 | 24.5 | 20.2 | 11.8 | 9.2 | 8.8 | **16.2** |
83
- | GPT-4o | 20.2 | 24.9 | 21.1 | 23.6 | 12.2 | 7.8 | **18.3** |
84
- | CogAgent | 67.0 | 24.0 | 74.2 | 20.0 | 70.4 | 28.6 | **47.4** |
85
- | SeeClick | 78.0 | 52.0 | 72.2 | 30.0 | 55.7 | 32.5 | **53.4** |
86
- | Qwen2-VL | 75.5 | 60.7 | 76.3 | 54.3 | 35.2 | 25.7 | **55.3** |
87
- | UGround-7B | 82.8 | 60.3 | 82.5 | 63.6 | 80.4 | 70.4 | **73.3** |
88
- | Aguvis-G-7B | 88.3 | 78.2 | 88.1 | 70.7 | 85.7 | 74.8 | **81.8** |
89
- | OS-Atlas-7B | 93.0 | 72.9 | 91.8 | 62.9 | 90.9 | 74.3 | **82.5** |
90
- | Claude Computer Use | - | - | - | - | - | - | **83.0** |
91
- | Gemini 2.0 (Project Mariner) | - | - | - | - | - | - | **84.0** |
92
- | Aguvis-7B | **95.6** | 77.7 | 93.8 | 67.1 | 88.3 | 75.2 | **84.4** |
93
- | Aguvis-72B | 94.5 | **85.2** | 95.4 | 77.9 | **91.3** | **85.9** | **89.2** |
94
- | **Our Model** | | | | | | | |
95
- | **UI-TARS-2B** | 93.0 | 75.5 | 90.7 | 68.6 | 84.3 | 74.8 | **82.3** |
96
- | **UI-TARS-7B** | 94.5 | **85.2** | **95.9** | 85.7 | 90.0 | 83.5 | **89.5** |
97
- | **UI-TARS-72B** | 94.9 | 82.5 | 89.7 | **88.6** | 88.7 | 85.0 | **88.4** |
98
-
99
-
100
- - **ScreenSpot v2**
101
-
102
- | Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
103
- |--------|-------------|-------------|-------------|-------------|-------------|---------|---------|
104
- | **Agent Framework** | | | | | | | |
105
- | GPT-4o (SeeClick) | 85.2 | 58.8 | 79.9 | 37.1 | 72.7 | 30.1 | **63.6** |
106
- | GPT-4o (OS-Atlas-4B) | 95.5 | 75.8 | 79.4 | 49.3 | 90.2 | 66.5 | **79.1** |
107
- | GPT-4o (OS-Atlas-7B) | 96.2 | 83.4 | 89.7 | 69.3 | **94.0** | 79.8 | **87.1** |
108
- | **Agent Model** | | | | | | | |
109
- | SeeClick | 78.4 | 50.7 | 70.1 | 29.3 | 55.2 | 32.5 | **55.1** |
110
- | OS-Atlas-4B | 87.2 | 59.7 | 72.7 | 46.4 | 85.9 | 63.1 | **71.9** |
111
- | OS-Atlas-7B | 95.2 | 75.8 | 90.7 | 63.6 | 90.6 | 77.3 | **84.1** |
112
- | **Our Model** | | | | | | | |
113
- | **UI-TARS-2B** | 95.2 | 79.1 | 90.7 | 68.6 | 87.2 | 78.3 | **84.7** |
114
- | **UI-TARS-7B** | **96.9** | **89.1** | **95.4** | 85.0 | 93.6 | 85.2 | **91.6** |
115
- | **UI-TARS-72B** | 94.8 | 86.3 | 91.2 | **87.9** | 91.5 | **87.7** | **90.3** |
116
-
117
-
118
- **Offline Agent Capability Evaluation**
119
- - **Multimodal Mind2Web**
120
-
121
- | Method | Cross-Task Ele.Acc | Cross-Task Op.F1 | Cross-Task Step SR | Cross-Website Ele.Acc | Cross-Website Op.F1 | Cross-Website Step SR | Cross-Domain Ele.Acc | Cross-Domain Op.F1 | Cross-Domain Step SR |
122
- |--------|----------------------|-------------------|--------------------|----------------------|--------------------|-------------------|--------------------|-------------------|-------------------|
123
- | **Agent Framework** | | | | | | | | | |
124
- | GPT-4o (SeeClick) | 32.1 | - | - | 33.1 | - | - | 33.5 | - | - |
125
- | GPT-4o (UGround) | 47.7 | - | - | 46.0 | - | - | 46.6 | - | - |
126
- | GPT-4o (Aria-UI) | 57.6 | - | - | 57.7 | - | - | 61.4 | - | - |
127
- | GPT-4V (OmniParser) | 42.4 | 87.6 | 39.4 | 41.0 | 84.8 | 36.5 | 45.5 | 85.7 | 42.0 |
128
- | **Agent Model** | | | | | | | | | |
129
- | GPT-4o | 5.7 | 77.2 | 4.3 | 5.7 | 79.0 | 3.9 | 5.5 | 86.4 | 4.5 |
130
- | GPT-4 (SOM) | 29.6 | - | 20.3 | 20.1 | - | 13.9 | 27.0 | - | 23.7 |
131
- | GPT-3.5 (Text-only) | 19.4 | 59.2 | 16.8 | 14.9 | 56.5 | 14.1 | 25.2 | 57.9 | 24.1 |
132
- | GPT-4 (Text-only) | 40.8 | 63.1 | 32.3 | 30.2 | 61.0 | 27.0 | 35.4 | 61.9 | 29.7 |
133
- | Claude | 62.7 | 84.7 | 53.5 | 59.5 | 79.6 | 47.7 | 64.5 | 85.4 | 56.4 |
134
- | Aguvis-7B | 64.2 | 89.8 | 60.4 | 60.7 | 88.1 | 54.6 | 60.4 | 89.2 | 56.6 |
135
- | CogAgent | - | - | 62.3 | - | - | 54.0 | - | - | 59.4 |
136
- | Aguvis-72B | 69.5 | 90.8 | 64.0 | 62.6 | 88.6 | 56.5 | 63.5 | 88.5 | 58.2 |
137
- | **Our Model** | | | | | | | | | |
138
- | **UI-TARS-2B** | 62.3 | 90.0 | 56.3 | 58.5 | 87.2 | 50.8 | 58.8 | 89.6 | 52.3 |
139
- | **UI-TARS-7B** | 73.1 | 92.2 | 67.1 | 68.2 | 90.9 | 61.7 | 66.6 | 90.9 | 60.5 |
140
- | **UI-TARS-72B** | **74.7** | **92.5** | **68.6** | **72.4** | **91.2** | **63.5** | **68.9** | **91.8** | **62.1** |
141
-
142
-
143
- - **Android Control and GUI Odyssey**
144
-
145
- | Agent Models | AndroidControl-Low Type | AndroidControl-Low Grounding | AndroidControl-Low SR | AndroidControl-High Type | AndroidControl-High Grounding | AndroidControl-High SR | GUIOdyssey Type | GUIOdyssey Grounding | GUIOdyssey SR |
146
- |---------------------|----------------------|----------------------|----------------|----------------------|----------------------|----------------|----------------|----------------|----------------|
147
- | Claude | 74.3 | 0.0 | 19.4 | 63.7 | 0.0 | 12.5 | 60.9 | 0.0 | 3.1 |
148
- | GPT-4o | 74.3 | 0.0 | 19.4 | 66.3 | 0.0 | 20.8 | 34.3 | 0.0 | 3.3 |
149
- | SeeClick | 93.0 | 73.4 | 75.0 | 82.9 | 62.9 | 59.1 | 71.0 | 52.4 | 53.9 |
150
- | InternVL-2-4B | 90.9 | 84.1 | 80.1 | 84.1 | 72.7 | 66.7 | 82.1 | 55.5 | 51.5 |
151
- | Qwen2-VL-7B | 91.9 | 86.5 | 82.6 | 83.8 | 77.7 | 69.7 | 83.5 | 65.9 | 60.2 |
152
- | Aria-UI | -- | 87.7 | 67.3 | -- | 43.2 | 10.2 | -- | 86.8 | 36.5 |
153
- | OS-Atlas-4B | 91.9 | 83.8 | 80.6 | 84.7 | 73.8 | 67.5 | 83.5 | 61.4 | 56.4 |
154
- | OS-Atlas-7B | 93.6 | 88.0 | 85.2 | 85.2 | 78.5 | 71.2 | 84.5 | 67.8 | 62.0 |
155
- | Aguvis-7B | -- | -- | 80.5 | -- | -- | 61.5 | -- | -- | -- |
156
- | Aguvis-72B | -- | -- | 84.4 | -- | -- | 66.4 | -- | -- | -- |
157
- | **UI-TARS-2B** | **98.1** | 87.3 | 89.3 | 81.2 | 78.4 | 68.9 | 93.9 | 86.8 | 83.4 |
158
- | **UI-TARS-7B** | 98.0 | 89.3 | 90.8 | 83.7 | 80.5 | 72.5 | 94.6 | 90.1 | 87.0 |
159
- | **UI-TARS-72B** | **98.1** | **89.9** | **91.3** | **85.2** | **81.5** | **74.7** | **95.4** | **91.4** | **88.6** |
160
-
161
- **Online Agent Capability Evaluation**
162
-
163
- | Method | OSWorld (Online) | AndroidWorld (Online) |
164
- |--------|-------------------|------------------|
165
- | **Agent Framework** | | |
166
- | GPT-4o (UGround) | - | 32.8 |
167
- | GPT-4o (Aria-UI) | 15.2 | 44.8 |
168
- | GPT-4o (Aguvis-7B) | 14.8 | 37.1 |
169
- | GPT-4o (Aguvis-72B) | 17.0 | - |
170
- | GPT-4o (OS-Atlas-7B) | 14.6 | - |
171
- | **Agent Model** | | |
172
- | GPT-4o | 5.0 | 34.5 (SoM) |
173
- | Gemini-Pro-1.5 | 5.4 | 22.8 (SoM) |
174
- | Aguvis-72B | 10.3 | 26.1 |
175
- | Claude Computer-Use | 14.9 (15 steps) | 27.9 |
176
- | Claude Computer-Use | 22.0 (50 steps) | - |
177
- | **Our Model** | | |
178
- | **UI-TARS-7B-SFT** | 17.7 (15 steps) | 33.0 |
179
- | **UI-TARS-7B-DPO** | 18.7 (15 steps) | - |
180
- | **UI-TARS-72B-SFT** | 18.8 (15 steps) | **46.6** |
181
- | **UI-TARS-72B-DPO** | **22.7** (15 steps) | - |
182
- | **UI-TARS-72B-DPO** | **24.6** (50 steps) | - |
183
-
184
- ## Citation
185
- If you find our paper and model useful in your research, feel free to give us a cite.
186
-
187
- ```BibTeX
188
- @article{uitars2025,
189
- author = {Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi},
190
- title = {UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
191
- journal = {arXiv preprint arXiv:2501.12326},
192
- url = {https://github.com/bytedance/UI-TARS},
193
- year = {2025}
194
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
195
  ```
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: image-text-to-text
6
+ tags:
7
+ - multimodal
8
+ - gui
9
+ library_name: transformers
10
+ ---
11
+
12
+ # UI-TARS-7B-SFT
13
+ [UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT) &nbsp;|&nbsp;
14
+ [UI-TARS-2B-gguf](https://huggingface.co/bytedance-research/UI-TARS-2B-gguf) &nbsp;|&nbsp;
15
+ [UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT) &nbsp;|&nbsp;
16
+ [UI-TARS-7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO) &nbsp;|&nbsp;
17
+ [UI-TARS-7B-gguf](https://huggingface.co/bytedance-research/UI-TARS-7B-gguf) &nbsp;|&nbsp;
18
+ [UI-TARS-72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT) &nbsp;|&nbsp;
19
+ [UI-TARS-72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO)
20
+
21
+ This repository contains the model for the paper [UI-TARS: Pioneering Automated GUI Interaction with Native Agents](https://huggingface.co/papers/2501.12326).
22
+ ## Introduction
23
+
24
+ UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules.
25
+ <!-- ![Local Image](figures/UI-TARS.png) -->
26
+ <p align="center">
27
+ <img src="https://github.com/bytedance/UI-TARS/blob/main/figures/UI-TARS-vs-Previous-SOTA.png?raw=true" width="90%"/>
28
+ <p>
29
+ <p align="center">
30
+ <img src="https://github.com/bytedance/UI-TARS/blob/main/figures/UI-TARS.png?raw=true" width="90%"/>
31
+ <p>
32
+
33
+ <!-- ![Local Image](figures/UI-TARS-vs-Previous-SOTA.png) -->
34
+
35
+ Code: https://github.com/bytedance/UI-TARS
36
+
37
+
38
+ ## Performance
39
+ **Perception Capabilty Evaluation**
40
+ | Model | VisualWebBench | WebSRC | SQAshort |
41
+ |---------------------------|---------------|---------|----------|
42
+ | Qwen2-VL-7B | 73.3 | 81.8 | 84.9 |
43
+ | Qwen-VL-Max | 74.1 | 91.1 | 78.6 |
44
+ | Gemini-1.5-Pro | 75.4 | 88.9 | 82.2 |
45
+ | UIX-Qwen2-7B | 75.9 | 82.9 | 78.8 |
46
+ | Claude-3.5-Sonnet | 78.2 | 90.4 | 83.1 |
47
+ | GPT-4o | 78.5 | 87.7 | 82.3 |
48
+ | **UI-TARS-2B** | 72.9 | 89.2 | 86.4 |
49
+ | **UI-TARS-7B** | 79.7 | **93.6** | 87.7 |
50
+ | **UI-TARS-72B** | **82.8** | 89.3 | **88.6** |
51
+
52
+ **Grounding Capability Evaluation**
53
+ - **ScreenSpot Pro**
54
+
55
+ | Agent Model | Dev-Text | Dev-Icon | Dev-Avg | Creative-Text | Creative-Icon | Creative-Avg | CAD-Text | CAD-Icon | CAD-Avg | Scientific-Text | Scientific-Icon | Scientific-Avg | Office-Text | Office-Icon | Office-Avg | OS-Text | OS-Icon | OS-Avg | Avg-Text | Avg-Icon | Avg |
56
+ |--------------------------|----------|----------|----------|--------------|--------------|--------------|---------|---------|---------|---------------|---------------|---------------|------------|------------|------------|--------|--------|--------|---------|---------|------|
57
+ | QwenVL-7B | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.7 | 0.0 | 0.4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | **0.1** |
58
+ | GPT-4o | 1.3 | 0.0 | 0.7 | 1.0 | 0.0 | 0.6 | 2.0 | 0.0 | 1.5 | 2.1 | 0.0 | 1.2 | 1.1 | 0.0 | 0.9 | 0.0 | 0.0 | 0.0 | 1.3 | 0.0 | **0.8** |
59
+ | SeeClick | 0.6 | 0.0 | 0.3 | 1.0 | 0.0 | 0.6 | 2.5 | 0.0 | 1.9 | 3.5 | 0.0 | 2.0 | 1.1 | 0.0 | 0.9 | 2.8 | 0.0 | 1.5 | 1.8 | 0.0 | **1.1** |
60
+ | Qwen2-VL-7B | 2.6 | 0.0 | 1.3 | 1.5 | 0.0 | 0.9 | 0.5 | 0.0 | 0.4 | 6.3 | 0.0 | 3.5 | 3.4 | 1.9 | 3.0 | 0.9 | 0.0 | 0.5 | 2.5 | 0.2 | **1.6** |
61
+ | OS-Atlas-4B | 7.1 | 0.0 | 3.7 | 3.0 | 1.4 | 2.3 | 2.0 | 0.0 | 1.5 | 9.0 | 5.5 | 7.5 | 5.1 | 3.8 | 4.8 | 5.6 | 0.0 | 3.1 | 5.0 | 1.7 | **3.7** |
62
+ | ShowUI-2B | 16.9 | 1.4 | 9.4 | 9.1 | 0.0 | 5.3 | 2.5 | 0.0 | 1.9 | 13.2 | 7.3 | 10.6 | 15.3 | 7.5 | 13.5 | 10.3 | 2.2 | 6.6 | 10.8 | 2.6 | **7.7** |
63
+ | CogAgent-18B | 14.9 | 0.7 | 8.0 | 9.6 | 0.0 | 5.6 | 7.1 | 3.1 | 6.1 | 22.2 | 1.8 | 13.4 | 13.0 | 0.0 | 10.0 | 5.6 | 0.0 | 3.1 | 12.0 | 0.8 | **7.7** |
64
+ | Aria-UI | 16.2 | 0.0 | 8.4 | 23.7 | 2.1 | 14.7 | 7.6 | 1.6 | 6.1 | 27.1 | 6.4 | 18.1 | 20.3 | 1.9 | 16.1 | 4.7 | 0.0 | 2.6 | 17.1 | 2.0 | **11.3** |
65
+ | UGround-7B | 26.6 | 2.1 | 14.7 | 27.3 | 2.8 | 17.0 | 14.2 | 1.6 | 11.1 | 31.9 | 2.7 | 19.3 | 31.6 | 11.3 | 27.0 | 17.8 | 0.0 | 9.7 | 25.0 | 2.8 | **16.5** |
66
+ | Claude Computer Use | 22.0 | 3.9 | 12.6 | 25.9 | 3.4 | 16.8 | 14.5 | 3.7 | 11.9 | 33.9 | 15.8 | 25.8 | 30.1 | 16.3 | 26.9 | 11.0 | 4.5 | 8.1 | 23.4 | 7.1 | **17.1** |
67
+ | OS-Atlas-7B | 33.1 | 1.4 | 17.7 | 28.8 | 2.8 | 17.9 | 12.2 | 4.7 | 10.3 | 37.5 | 7.3 | 24.4 | 33.9 | 5.7 | 27.4 | 27.1 | 4.5 | 16.8 | 28.1 | 4.0 | **18.9** |
68
+ | UGround-V1-7B | - | - | 35.5 | - | - | 27.8 | - | - | 13.5 | - | - | 38.8 | - | - | 48.8 | - | - | 26.1 | - | - | **31.1** |
69
+ | **UI-TARS-2B** | 47.4 | 4.1 | 26.4 | 42.9 | 6.3 | 27.6 | 17.8 | 4.7 | 14.6 | 56.9 | 17.3 | 39.8 | 50.3 | 17.0 | 42.6 | 21.5 | 5.6 | 14.3 | 39.6 | 8.4 | **27.7** |
70
+ | **UI-TARS-7B** | 58.4 | 12.4 | 36.1 | 50.0 | 9.1 | 32.8 | **20.8**| 9.4 | **18.0**| 63.9 | **31.8** | **50.0** | **63.3** | 20.8 | 53.5 | 30.8 | **16.9**| 24.5 | 47.8 | 16.2 | **35.7** |
71
+ | **UI-TARS-72B** | **63.0** | **17.3** | **40.8** | **57.1** | **15.4** | **39.6** | 18.8 | **12.5**| 17.2 | **64.6** | 20.9 | 45.7 | **63.3** | **26.4** | **54.8** | **42.1**| 15.7 | **30.1**| **50.9**| **17.5**| **38.1** |
72
+
73
+
74
+ - **ScreenSpot v2**
75
+
76
+ | Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
77
+ |--------|-------------|-------------|-------------|-------------|-------------|---------|---------|
78
+ | **Agent Framework** | | | | | | | |
79
+ | GPT-4o (SeeClick) | 85.2 | 58.8 | 79.9 | 37.1 | 72.7 | 30.1 | **63.6** |
80
+ | GPT-4o (OS-Atlas-4B) | 95.5 | 75.8 | 79.4 | 49.3 | 90.2 | 66.5 | **79.1** |
81
+ | GPT-4o (OS-Atlas-7B) | 96.2 | 83.4 | 89.7 | 69.3 | **94.0** | 79.8 | **87.1** |
82
+ | **Agent Model** | | | | | | | |
83
+ | SeeClick | 78.4 | 50.7 | 70.1 | 29.3 | 55.2 | 32.5 | **55.1** |
84
+ | OS-Atlas-4B | 87.2 | 59.7 | 72.7 | 46.4 | 85.9 | 63.1 | **71.9** |
85
+ | OS-Atlas-7B | 95.2 | 75.8 | 90.7 | 63.6 | 90.6 | 77.3 | **84.1** |
86
+ | **Our Model** | | | | | | | |
87
+ | **UI-TARS-2B** | 95.2 | 79.1 | 90.7 | 68.6 | 87.2 | 78.3 | **84.7** |
88
+ | **UI-TARS-7B** | **96.9** | **89.1** | **95.4** | 85.0 | 93.6 | 85.2 | **91.6** |
89
+ | **UI-TARS-72B** | 94.8 | 86.3 | 91.2 | **87.9** | 91.5 | **87.7** | **90.3** |
90
+
91
+
92
+ **Online Agent Capability Evaluation**
93
+
94
+ | Method | OSWorld (Online) | AndroidWorld (Online) |
95
+ |--------|-------------------|------------------|
96
+ | **Agent Framework** | | |
97
+ | GPT-4o (UGround) | - | 32.8 |
98
+ | GPT-4o (Aria-UI) | 15.2 | 44.8 |
99
+ | GPT-4o (Aguvis-7B) | 14.8 | 37.1 |
100
+ | GPT-4o (Aguvis-72B) | 17.0 | - |
101
+ | GPT-4o (OS-Atlas-7B) | 14.6 | - |
102
+ | **Agent Model** | | |
103
+ | GPT-4o | 5.0 | 34.5 (SoM) |
104
+ | Gemini-Pro-1.5 | 5.4 | 22.8 (SoM) |
105
+ | Aguvis-72B | 10.3 | 26.1 |
106
+ | Claude Computer-Use | 14.9 (15 steps) | 27.9 |
107
+ | Claude Computer-Use | 22.0 (50 steps) | - |
108
+ | **Our Model** | | |
109
+ | **UI-TARS-7B-SFT** | 17.7 (15 steps) | 33.0 |
110
+ | **UI-TARS-7B-DPO** | 18.7 (15 steps) | - |
111
+ | **UI-TARS-72B-SFT** | 18.8 (15 steps) | **46.6** |
112
+ | **UI-TARS-72B-DPO** | **22.7** (15 steps) | - |
113
+ | **UI-TARS-72B-DPO** | **24.6** (50 steps) | - |
114
+
115
+ ## Deployment
116
+
117
+ ### Cloud Deployment
118
+ We recommend using HuggingFace Inference Endpoints for fast deployment.
119
+ We provide two docs for users to refer:
120
+
121
+ English version: [GUI Model Deployment Guide](https://juniper-switch-f10.notion.site/GUI-Model-Deployment-Guide-17b5350241e280058e98cea60317de71)
122
+
123
+ 中文版: [GUI模型部署教程](https://bytedance.sg.larkoffice.com/docx/TCcudYwyIox5vyxiSDLlgIsTgWf#U94rdCxzBoJMLex38NPlHL21gNb)
124
+
125
+ ### Local Deployment [Transformers]
126
+ We follow the same way as Qwen2-VL, check this [tutorial](https://github.com/QwenLM/Qwen2-VL?tab=readme-ov-file#using---transformers-to-chat) for more details.
127
+
128
+ ### Local Deployment [vLLM]
129
+ We recommend using vLLM for fast deployment and inference. You need to use `vllm>=0.6.1`.
130
+ ```bash
131
+ pip install -U transformers
132
+ VLLM_VERSION=0.6.6
133
+ CUDA_VERSION=cu124
134
+ pip install vllm==${VLLM_VERSION} --extra-index-url https://download.pytorch.org/whl/${CUDA_VERSION}
135
+
136
+ ```
137
+ #### Start an OpenAI API Service
138
+
139
+ Run the command below to start an OpenAI-compatible API service:
140
+
141
+ ```bash
142
+ python -m vllm.entrypoints.openai.api_server --served-model-name ui-tars --model <path to your model>
143
+ ```
144
+
145
+ Then you can use the chat API as below with the gui prompt (choose from mobile or computer) and base64-encoded local images (see [OpenAI API protocol document](https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images) for more details):
146
+ ```python
147
+ import base64
148
+ from openai import OpenAI
149
+
150
+
151
+ instruction = "search for today's weather"
152
+ screenshot_path = "screenshot.png"
153
+ client = OpenAI(
154
+ base_url="http://127.0.0.1:8000/v1",
155
+ api_key="empty",
156
+ )
157
+
158
+ ## Below is the prompt for mobile
159
+ prompt = r"""<|im_start|>system
160
+ You are a helpful assistant.<|im_end|>
161
+ <|im_start|>user
162
+ You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
163
+
164
+ ## Output Format
165
+ ```\nAction_Summary: ...
166
+ Action: ...\n```
167
+
168
+ ## Action Space
169
+ click(start_box='<|box_start|>(x1,y1)<|box_end|>')
170
+ long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
171
+ type(content='')
172
+ scroll(direction='down or up or right or left')
173
+ open_app(app_name='')
174
+ navigate_back()
175
+ navigate_home()
176
+ WAIT()
177
+ finished() # Submit the task regardless of whether it succeeds or fails.
178
+
179
+ ## Note
180
+ - Use English in `Action_Summary` part.
181
+
182
+ ## User Instruction
183
+ """
184
+
185
+ with open(screenshot_path, "rb") as image_file:
186
+ encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
187
+ response = client.chat.completions.create(
188
+ model="ui-tars",
189
+ messages=[
190
+ {
191
+ "role": "user",
192
+ "content": [
193
+ {"type": "text", "text": prompt + instruction},
194
+ {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}},
195
+ ],
196
+ },
197
+ ],
198
+ frequency_penalty=1,
199
+ max_tokens=128,
200
+ )
201
+ print(response.choices[0].message.content)
202
+ ```
203
+
204
+ ### Prompt Templates
205
+ We provide two prompt templates currently for stable running and performance, one for mobile scene and one for personal computer scene.
206
+ - Prompt template for mobile:
207
+ ```python
208
+ ## Below is the prompt for mobile
209
+ prompt = r"""<|im_start|>system
210
+ You are a helpful assistant.<|im_end|>
211
+ <|im_start|>user
212
+ You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
213
+
214
+ ## Output Format
215
+ ```\nThought: ...
216
+ Action: ...\n```
217
+
218
+ ## Action Space
219
+ click(start_box='<|box_start|>(x1,y1)<|box_end|>')
220
+ long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
221
+ type(content='')
222
+ scroll(direction='down or up or right or left')
223
+ open_app(app_name='')
224
+ navigate_back()
225
+ navigate_home()
226
+ WAIT()
227
+ finished() # Submit the task regardless of whether it succeeds or fails.
228
+
229
+ ## Note
230
+ - Use English in `Action_Summary` part.
231
+
232
+ ## User Instruction
233
+ """
234
+ ```
235
+
236
+ - Prompt template for computer:
237
+ ```python
238
+ ## Below is the prompt for computer
239
+ prompt = r"""<|im_start|>system
240
+ You are a helpful assistant.<|im_end|>
241
+ <|im_start|>user
242
+ You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
243
+
244
+ ## Output Format
245
+ ```\nThought: ...
246
+ Action: ...\n```
247
+
248
+ ## Action Space
249
+
250
+ click(start_box='<|box_start|>(x1,y1)<|box_end|>')
251
+ left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
252
+ right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
253
+ drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
254
+ hotkey(key='')
255
+ type(content='') #If you want to submit your input, use \"\
256
+ \" at the end of `content`.
257
+ scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
258
+ wait() #Sleep for 5s and take a screenshot to check for any changes.
259
+ finished()
260
+ call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.
261
+
262
+
263
+ ## Note
264
+ - Use Chinese in `Thought` part.
265
+ - Summarize your next action (with its target element) in one sentence in `Thought` part.
266
+
267
+ ## User Instruction
268
+ """
269
+ ```
270
+
271
+ ### Local Deployment [Ollama]
272
+ Ollama can deploy the model via gguf format. Bugs exist for safetensors.
273
+
274
+ #### Get the model in GGUF format
275
+ We provide 2B and 7B model in [GGUF](https://huggingface.co/docs/hub/en/gguf) format:
276
+
277
+ 2B: https://huggingface.co/bytedance-research/UI-TARS-2B-gguf
278
+
279
+ 7B: https://huggingface.co/bytedance-research/UI-TARS-7B-gguf
280
+
281
+ Users can convert the model into GGUF format by using the script from [llama.cpp](https://github.com/ggerganov/llama.cpp/blob/master/convert_hf_to_gguf.py):
282
+
283
+ ```bash
284
+ python3 convert_hf_to_gguf.py <path to your model>
285
+ ```
286
+
287
+ The GGUF file will be generated under the path provided.
288
+
289
+ #### Deploy GGUF model
290
+ We deploy the model by following Ollama [tutorial](https://github.com/ollama/ollama?tab=readme-ov-file#customize-a-model).
291
+
292
+ ```bash
293
+ # Create Modelfile, Windows users can just create a file named Modelfile
294
+ echo "FROM ./path/to/model.gguf" > Modelfile
295
+
296
+ # Create model in Ollama
297
+ ollama create ui-tars -f Modelfile
298
+
299
+ # Run the model
300
+ ollama run ui-tars
301
+
302
+ ```
303
+
304
+ Test script is same as vLLM except two changes:
305
+
306
+ ```python
307
+ ...
308
+ client = OpenAI(
309
+ base_url="http://127.0.0.1:11434/v1/",
310
+ ...
311
+ )
312
+ ...
313
+ response = client.chat.completions.create(
314
+ model="ui-tars" # the name we create via Ollama cli
315
+ ...
316
+ )
317
+
318
+ ```
319
+
320
+ ### Explanation of Inference Results
321
+
322
+ #### Coordinate Mapping
323
+ The model generates a 2D coordinate output that represents relative positions. To convert these values to image-relative coordinates, divide each component by 1000 to obtain values in the range [0,1]. The absolute coordinates required by the Action can be calculated by:
324
+ - X absolute = X relative × image width
325
+ - Y absolute = Y relative × image height
326
+
327
+ For example, given a screen size: 1920 × 1080, and the model generates a coordinate output of (235, 512). The X absolute is `round(1920*235/1000)=451`. The Y absolute is `round(1080*512/1000)=553`. The absolute coordinate is (451, 553)
328
+
329
+ ## Use in desktop and web automation
330
+
331
+ To experience ui-tars agent in desktop, you may refer to [UI-TARS-desktop](https://github.com/bytedance/UI-TARS-desktop).
332
+
333
+ [Midscene.js](https://github.com/web-infra-dev/Midscene) is an open-source web automation SDK that has supported UI-TARS model. Developers can use javascript and natural language to control the browser. See [this guide](https://midscenejs.com/choose-a-model) for more details about setting up the model.
334
+
335
+ ## License
336
+
337
+ UI-TARS is licensed under the Apache License 2.0.
338
+
339
+ ## Acknowledgements
340
+ This project builds upon and extends the capabilities of Qwen-2-VL, a powerful vision-language model, which serves as the foundational architecture for UI-TARS. We would like to acknowledge the contributions of the developers and researchers behind Qwen-2-VL for their groundbreaking work in the field of multimodal AI and for providing a robust base for further advancements.
341
+
342
+ Additionally, we thank the broader open-source community for their datasets, tools, and insights that have facilitated the development of UI-TARS. These collaborative efforts continue to push the boundaries of what GUI automation and AI-driven agents can achieve.
343
+
344
+ ## Citation
345
+ If you find our paper and code useful in your research, feel free to give us a cite.
346
+
347
+ ```BibTeX
348
+ @article{uitars2025,
349
+ author = {Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi},
350
+ title = {UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
351
+ journal = {arXiv preprint arXiv:2501.12326},
352
+ url = {https://github.com/bytedance/UI-TARS},
353
+ year = {2025}
354
+ }
355
  ```