--- license: apache-2.0 language: - en pipeline_tag: image-text-to-text tags: - multimodal - gui library_name: transformers --- # UI-TARS-7B-SFT [UI-TARS-2B-SFT](https://huggingface.co/bytedance-research/UI-TARS-2B-SFT) | [UI-TARS-2B-gguf](https://huggingface.co/bytedance-research/UI-TARS-2B-gguf) | [UI-TARS-7B-SFT](https://huggingface.co/bytedance-research/UI-TARS-7B-SFT) | [UI-TARS-7B-DPO](https://huggingface.co/bytedance-research/UI-TARS-7B-DPO) | [UI-TARS-7B-gguf](https://huggingface.co/bytedance-research/UI-TARS-7B-gguf) | [UI-TARS-72B-SFT](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT) | [UI-TARS-72B-DPO](https://huggingface.co/bytedance-research/UI-TARS-72B-DPO) This repository contains the model for the paper [UI-TARS: Pioneering Automated GUI Interaction with Native Agents](https://huggingface.co/papers/2501.12326). ## Introduction UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules.
Code: https://github.com/bytedance/UI-TARS
## Performance
**Perception Capabilty Evaluation**
| Model | VisualWebBench | WebSRC | SQAshort |
|---------------------------|---------------|---------|----------|
| Qwen2-VL-7B | 73.3 | 81.8 | 84.9 |
| Qwen-VL-Max | 74.1 | 91.1 | 78.6 |
| Gemini-1.5-Pro | 75.4 | 88.9 | 82.2 |
| UIX-Qwen2-7B | 75.9 | 82.9 | 78.8 |
| Claude-3.5-Sonnet | 78.2 | 90.4 | 83.1 |
| GPT-4o | 78.5 | 87.7 | 82.3 |
| **UI-TARS-2B** | 72.9 | 89.2 | 86.4 |
| **UI-TARS-7B** | 79.7 | **93.6** | 87.7 |
| **UI-TARS-72B** | **82.8** | 89.3 | **88.6** |
**Grounding Capability Evaluation**
- **ScreenSpot Pro**
| Agent Model | Dev-Text | Dev-Icon | Dev-Avg | Creative-Text | Creative-Icon | Creative-Avg | CAD-Text | CAD-Icon | CAD-Avg | Scientific-Text | Scientific-Icon | Scientific-Avg | Office-Text | Office-Icon | Office-Avg | OS-Text | OS-Icon | OS-Avg | Avg-Text | Avg-Icon | Avg |
|--------------------------|----------|----------|----------|--------------|--------------|--------------|---------|---------|---------|---------------|---------------|---------------|------------|------------|------------|--------|--------|--------|---------|---------|------|
| QwenVL-7B | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.7 | 0.0 | 0.4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.1 | 0.0 | **0.1** |
| GPT-4o | 1.3 | 0.0 | 0.7 | 1.0 | 0.0 | 0.6 | 2.0 | 0.0 | 1.5 | 2.1 | 0.0 | 1.2 | 1.1 | 0.0 | 0.9 | 0.0 | 0.0 | 0.0 | 1.3 | 0.0 | **0.8** |
| SeeClick | 0.6 | 0.0 | 0.3 | 1.0 | 0.0 | 0.6 | 2.5 | 0.0 | 1.9 | 3.5 | 0.0 | 2.0 | 1.1 | 0.0 | 0.9 | 2.8 | 0.0 | 1.5 | 1.8 | 0.0 | **1.1** |
| Qwen2-VL-7B | 2.6 | 0.0 | 1.3 | 1.5 | 0.0 | 0.9 | 0.5 | 0.0 | 0.4 | 6.3 | 0.0 | 3.5 | 3.4 | 1.9 | 3.0 | 0.9 | 0.0 | 0.5 | 2.5 | 0.2 | **1.6** |
| OS-Atlas-4B | 7.1 | 0.0 | 3.7 | 3.0 | 1.4 | 2.3 | 2.0 | 0.0 | 1.5 | 9.0 | 5.5 | 7.5 | 5.1 | 3.8 | 4.8 | 5.6 | 0.0 | 3.1 | 5.0 | 1.7 | **3.7** |
| ShowUI-2B | 16.9 | 1.4 | 9.4 | 9.1 | 0.0 | 5.3 | 2.5 | 0.0 | 1.9 | 13.2 | 7.3 | 10.6 | 15.3 | 7.5 | 13.5 | 10.3 | 2.2 | 6.6 | 10.8 | 2.6 | **7.7** |
| CogAgent-18B | 14.9 | 0.7 | 8.0 | 9.6 | 0.0 | 5.6 | 7.1 | 3.1 | 6.1 | 22.2 | 1.8 | 13.4 | 13.0 | 0.0 | 10.0 | 5.6 | 0.0 | 3.1 | 12.0 | 0.8 | **7.7** |
| Aria-UI | 16.2 | 0.0 | 8.4 | 23.7 | 2.1 | 14.7 | 7.6 | 1.6 | 6.1 | 27.1 | 6.4 | 18.1 | 20.3 | 1.9 | 16.1 | 4.7 | 0.0 | 2.6 | 17.1 | 2.0 | **11.3** |
| UGround-7B | 26.6 | 2.1 | 14.7 | 27.3 | 2.8 | 17.0 | 14.2 | 1.6 | 11.1 | 31.9 | 2.7 | 19.3 | 31.6 | 11.3 | 27.0 | 17.8 | 0.0 | 9.7 | 25.0 | 2.8 | **16.5** |
| Claude Computer Use | 22.0 | 3.9 | 12.6 | 25.9 | 3.4 | 16.8 | 14.5 | 3.7 | 11.9 | 33.9 | 15.8 | 25.8 | 30.1 | 16.3 | 26.9 | 11.0 | 4.5 | 8.1 | 23.4 | 7.1 | **17.1** |
| OS-Atlas-7B | 33.1 | 1.4 | 17.7 | 28.8 | 2.8 | 17.9 | 12.2 | 4.7 | 10.3 | 37.5 | 7.3 | 24.4 | 33.9 | 5.7 | 27.4 | 27.1 | 4.5 | 16.8 | 28.1 | 4.0 | **18.9** |
| UGround-V1-7B | - | - | 35.5 | - | - | 27.8 | - | - | 13.5 | - | - | 38.8 | - | - | 48.8 | - | - | 26.1 | - | - | **31.1** |
| **UI-TARS-2B** | 47.4 | 4.1 | 26.4 | 42.9 | 6.3 | 27.6 | 17.8 | 4.7 | 14.6 | 56.9 | 17.3 | 39.8 | 50.3 | 17.0 | 42.6 | 21.5 | 5.6 | 14.3 | 39.6 | 8.4 | **27.7** |
| **UI-TARS-7B** | 58.4 | 12.4 | 36.1 | 50.0 | 9.1 | 32.8 | **20.8**| 9.4 | **18.0**| 63.9 | **31.8** | **50.0** | **63.3** | 20.8 | 53.5 | 30.8 | **16.9**| 24.5 | 47.8 | 16.2 | **35.7** |
| **UI-TARS-72B** | **63.0** | **17.3** | **40.8** | **57.1** | **15.4** | **39.6** | 18.8 | **12.5**| 17.2 | **64.6** | 20.9 | 45.7 | **63.3** | **26.4** | **54.8** | **42.1**| 15.7 | **30.1**| **50.9**| **17.5**| **38.1** |
- **ScreenSpot v2**
| Method | Mobile-Text | Mobile-Icon/Widget | Desktop-Text | Desktop-Icon/Widget | Web-Text | Web-Icon/Widget | Avg |
|--------|-------------|-------------|-------------|-------------|-------------|---------|---------|
| **Agent Framework** | | | | | | | |
| GPT-4o (SeeClick) | 85.2 | 58.8 | 79.9 | 37.1 | 72.7 | 30.1 | **63.6** |
| GPT-4o (OS-Atlas-4B) | 95.5 | 75.8 | 79.4 | 49.3 | 90.2 | 66.5 | **79.1** |
| GPT-4o (OS-Atlas-7B) | 96.2 | 83.4 | 89.7 | 69.3 | **94.0** | 79.8 | **87.1** |
| **Agent Model** | | | | | | | |
| SeeClick | 78.4 | 50.7 | 70.1 | 29.3 | 55.2 | 32.5 | **55.1** |
| OS-Atlas-4B | 87.2 | 59.7 | 72.7 | 46.4 | 85.9 | 63.1 | **71.9** |
| OS-Atlas-7B | 95.2 | 75.8 | 90.7 | 63.6 | 90.6 | 77.3 | **84.1** |
| **Our Model** | | | | | | | |
| **UI-TARS-2B** | 95.2 | 79.1 | 90.7 | 68.6 | 87.2 | 78.3 | **84.7** |
| **UI-TARS-7B** | **96.9** | **89.1** | **95.4** | 85.0 | 93.6 | 85.2 | **91.6** |
| **UI-TARS-72B** | 94.8 | 86.3 | 91.2 | **87.9** | 91.5 | **87.7** | **90.3** |
**Online Agent Capability Evaluation**
| Method | OSWorld (Online) | AndroidWorld (Online) |
|--------|-------------------|------------------|
| **Agent Framework** | | |
| GPT-4o (UGround) | - | 32.8 |
| GPT-4o (Aria-UI) | 15.2 | 44.8 |
| GPT-4o (Aguvis-7B) | 14.8 | 37.1 |
| GPT-4o (Aguvis-72B) | 17.0 | - |
| GPT-4o (OS-Atlas-7B) | 14.6 | - |
| **Agent Model** | | |
| GPT-4o | 5.0 | 34.5 (SoM) |
| Gemini-Pro-1.5 | 5.4 | 22.8 (SoM) |
| Aguvis-72B | 10.3 | 26.1 |
| Claude Computer-Use | 14.9 (15 steps) | 27.9 |
| Claude Computer-Use | 22.0 (50 steps) | - |
| **Our Model** | | |
| **UI-TARS-7B-SFT** | 17.7 (15 steps) | 33.0 |
| **UI-TARS-7B-DPO** | 18.7 (15 steps) | - |
| **UI-TARS-72B-SFT** | 18.8 (15 steps) | **46.6** |
| **UI-TARS-72B-DPO** | **22.7** (15 steps) | - |
| **UI-TARS-72B-DPO** | **24.6** (50 steps) | - |
## Deployment
### Cloud Deployment
We recommend using HuggingFace Inference Endpoints for fast deployment.
We provide two docs for users to refer:
English version: [GUI Model Deployment Guide](https://juniper-switch-f10.notion.site/GUI-Model-Deployment-Guide-17b5350241e280058e98cea60317de71)
中文版: [GUI模型部署教程](https://bytedance.sg.larkoffice.com/docx/TCcudYwyIox5vyxiSDLlgIsTgWf#U94rdCxzBoJMLex38NPlHL21gNb)
### Local Deployment [Transformers]
We follow the same way as Qwen2-VL, check this [tutorial](https://github.com/QwenLM/Qwen2-VL?tab=readme-ov-file#using---transformers-to-chat) for more details.
### Local Deployment [vLLM]
We recommend using vLLM for fast deployment and inference. You need to use `vllm>=0.6.1`.
```bash
pip install -U transformers
VLLM_VERSION=0.6.6
CUDA_VERSION=cu124
pip install vllm==${VLLM_VERSION} --extra-index-url https://download.pytorch.org/whl/${CUDA_VERSION}
```
#### Start an OpenAI API Service
Run the command below to start an OpenAI-compatible API service:
```bash
python -m vllm.entrypoints.openai.api_server --served-model-name ui-tars --model