UI-TARS-7B-SFT / README.md
nielsr's picture
nielsr HF staff
Add link to paper
42e3fde verified
|
raw
history blame
18 kB
metadata
license: apache-2.0
language:
  - en
pipeline_tag: image-text-to-text
tags:
  - multimodal
  - gui
library_name: transformers

UI-TARS-7B-SFT

UI-TARS-2B-SFT  |  UI-TARS-2B-gguf  |  UI-TARS-7B-SFT  |  UI-TARS-7B-DPO  |  UI-TARS-7B-gguf  |  UI-TARS-72B-SFT  |  UI-TARS-72B-DPO

This repository contains the model for the paper UI-TARS: Pioneering Automated GUI Interaction with Native Agents.

Introduction

UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules.

Code: https://github.com/bytedance/UI-TARS

Performance

Perception Capabilty Evaluation

Model VisualWebBench WebSRC SQAshort
Qwen2-VL-7B 73.3 81.8 84.9
Qwen-VL-Max 74.1 91.1 78.6
Gemini-1.5-Pro 75.4 88.9 82.2
UIX-Qwen2-7B 75.9 82.9 78.8
Claude-3.5-Sonnet 78.2 90.4 83.1
GPT-4o 78.5 87.7 82.3
UI-TARS-2B 72.9 89.2 86.4
UI-TARS-7B 79.7 93.6 87.7
UI-TARS-72B 82.8 89.3 88.6

Grounding Capability Evaluation

  • ScreenSpot Pro
Agent Model Dev-Text Dev-Icon Dev-Avg Creative-Text Creative-Icon Creative-Avg CAD-Text CAD-Icon CAD-Avg Scientific-Text Scientific-Icon Scientific-Avg Office-Text Office-Icon Office-Avg OS-Text OS-Icon OS-Avg Avg-Text Avg-Icon Avg
QwenVL-7B 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.7 0.0 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0 0.1
GPT-4o 1.3 0.0 0.7 1.0 0.0 0.6 2.0 0.0 1.5 2.1 0.0 1.2 1.1 0.0 0.9 0.0 0.0 0.0 1.3 0.0 0.8
SeeClick 0.6 0.0 0.3 1.0 0.0 0.6 2.5 0.0 1.9 3.5 0.0 2.0 1.1 0.0 0.9 2.8 0.0 1.5 1.8 0.0 1.1
Qwen2-VL-7B 2.6 0.0 1.3 1.5 0.0 0.9 0.5 0.0 0.4 6.3 0.0 3.5 3.4 1.9 3.0 0.9 0.0 0.5 2.5 0.2 1.6
OS-Atlas-4B 7.1 0.0 3.7 3.0 1.4 2.3 2.0 0.0 1.5 9.0 5.5 7.5 5.1 3.8 4.8 5.6 0.0 3.1 5.0 1.7 3.7
ShowUI-2B 16.9 1.4 9.4 9.1 0.0 5.3 2.5 0.0 1.9 13.2 7.3 10.6 15.3 7.5 13.5 10.3 2.2 6.6 10.8 2.6 7.7
CogAgent-18B 14.9 0.7 8.0 9.6 0.0 5.6 7.1 3.1 6.1 22.2 1.8 13.4 13.0 0.0 10.0 5.6 0.0 3.1 12.0 0.8 7.7
Aria-UI 16.2 0.0 8.4 23.7 2.1 14.7 7.6 1.6 6.1 27.1 6.4 18.1 20.3 1.9 16.1 4.7 0.0 2.6 17.1 2.0 11.3
UGround-7B 26.6 2.1 14.7 27.3 2.8 17.0 14.2 1.6 11.1 31.9 2.7 19.3 31.6 11.3 27.0 17.8 0.0 9.7 25.0 2.8 16.5
Claude Computer Use 22.0 3.9 12.6 25.9 3.4 16.8 14.5 3.7 11.9 33.9 15.8 25.8 30.1 16.3 26.9 11.0 4.5 8.1 23.4 7.1 17.1
OS-Atlas-7B 33.1 1.4 17.7 28.8 2.8 17.9 12.2 4.7 10.3 37.5 7.3 24.4 33.9 5.7 27.4 27.1 4.5 16.8 28.1 4.0 18.9
UGround-V1-7B - - 35.5 - - 27.8 - - 13.5 - - 38.8 - - 48.8 - - 26.1 - - 31.1
UI-TARS-2B 47.4 4.1 26.4 42.9 6.3 27.6 17.8 4.7 14.6 56.9 17.3 39.8 50.3 17.0 42.6 21.5 5.6 14.3 39.6 8.4 27.7
UI-TARS-7B 58.4 12.4 36.1 50.0 9.1 32.8 20.8 9.4 18.0 63.9 31.8 50.0 63.3 20.8 53.5 30.8 16.9 24.5 47.8 16.2 35.7
UI-TARS-72B 63.0 17.3 40.8 57.1 15.4 39.6 18.8 12.5 17.2 64.6 20.9 45.7 63.3 26.4 54.8 42.1 15.7 30.1 50.9 17.5 38.1
  • ScreenSpot v2
Method Mobile-Text Mobile-Icon/Widget Desktop-Text Desktop-Icon/Widget Web-Text Web-Icon/Widget Avg
Agent Framework
GPT-4o (SeeClick) 85.2 58.8 79.9 37.1 72.7 30.1 63.6
GPT-4o (OS-Atlas-4B) 95.5 75.8 79.4 49.3 90.2 66.5 79.1
GPT-4o (OS-Atlas-7B) 96.2 83.4 89.7 69.3 94.0 79.8 87.1
Agent Model
SeeClick 78.4 50.7 70.1 29.3 55.2 32.5 55.1
OS-Atlas-4B 87.2 59.7 72.7 46.4 85.9 63.1 71.9
OS-Atlas-7B 95.2 75.8 90.7 63.6 90.6 77.3 84.1
Our Model
UI-TARS-2B 95.2 79.1 90.7 68.6 87.2 78.3 84.7
UI-TARS-7B 96.9 89.1 95.4 85.0 93.6 85.2 91.6
UI-TARS-72B 94.8 86.3 91.2 87.9 91.5 87.7 90.3

Online Agent Capability Evaluation

Method OSWorld (Online) AndroidWorld (Online)
Agent Framework
GPT-4o (UGround) - 32.8
GPT-4o (Aria-UI) 15.2 44.8
GPT-4o (Aguvis-7B) 14.8 37.1
GPT-4o (Aguvis-72B) 17.0 -
GPT-4o (OS-Atlas-7B) 14.6 -
Agent Model
GPT-4o 5.0 34.5 (SoM)
Gemini-Pro-1.5 5.4 22.8 (SoM)
Aguvis-72B 10.3 26.1
Claude Computer-Use 14.9 (15 steps) 27.9
Claude Computer-Use 22.0 (50 steps) -
Our Model
UI-TARS-7B-SFT 17.7 (15 steps) 33.0
UI-TARS-7B-DPO 18.7 (15 steps) -
UI-TARS-72B-SFT 18.8 (15 steps) 46.6
UI-TARS-72B-DPO 22.7 (15 steps) -
UI-TARS-72B-DPO 24.6 (50 steps) -

Deployment

Cloud Deployment

We recommend using HuggingFace Inference Endpoints for fast deployment. We provide two docs for users to refer:

English version: GUI Model Deployment Guide

中文版: GUI模型部署教程

Local Deployment [Transformers]

We follow the same way as Qwen2-VL, check this tutorial for more details.

Local Deployment [vLLM]

We recommend using vLLM for fast deployment and inference. You need to use vllm>=0.6.1.

pip install -U transformers
VLLM_VERSION=0.6.6
CUDA_VERSION=cu124
pip install vllm==${VLLM_VERSION} --extra-index-url https://download.pytorch.org/whl/${CUDA_VERSION}

Start an OpenAI API Service

Run the command below to start an OpenAI-compatible API service:

python -m vllm.entrypoints.openai.api_server --served-model-name ui-tars --model <path to your model>

Then you can use the chat API as below with the gui prompt (choose from mobile or computer) and base64-encoded local images (see OpenAI API protocol document for more details):

import base64
from openai import OpenAI


instruction = "search for today's weather"
screenshot_path = "screenshot.png"
client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="empty",
)

## Below is the prompt for mobile
prompt = r"""<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. 

## Output Format
```\nAction_Summary: ...
Action: ...\n```

## Action Space
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
type(content='')
scroll(direction='down or up or right or left')
open_app(app_name='')
navigate_back()
navigate_home()
WAIT()
finished() # Submit the task regardless of whether it succeeds or fails.

## Note
- Use English in `Action_Summary` part.

## User Instruction
"""

with open(screenshot_path, "rb") as image_file:
    encoded_string = base64.b64encode(image_file.read()).decode("utf-8")
response = client.chat.completions.create(
    model="ui-tars",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": prompt + instruction},
                {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded_string}"}},
            ],
        },
    ],
    frequency_penalty=1,
    max_tokens=128,
)
print(response.choices[0].message.content)

Prompt Templates

We provide two prompt templates currently for stable running and performance, one for mobile scene and one for personal computer scene.

  • Prompt template for mobile:
## Below is the prompt for mobile
prompt = r"""<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. 

## Output Format
```\nThought: ...
Action: ...\n```

## Action Space
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
long_press(start_box='<|box_start|>(x1,y1)<|box_end|>', time='')
type(content='')
scroll(direction='down or up or right or left')
open_app(app_name='')
navigate_back()
navigate_home()
WAIT()
finished() # Submit the task regardless of whether it succeeds or fails.

## Note
- Use English in `Action_Summary` part.

## User Instruction
"""
  • Prompt template for computer:
## Below is the prompt for computer
prompt = r"""<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task. 

## Output Format
```\nThought: ...
Action: ...\n```

## Action Space

click(start_box='<|box_start|>(x1,y1)<|box_end|>')
left_double(start_box='<|box_start|>(x1,y1)<|box_end|>')
right_single(start_box='<|box_start|>(x1,y1)<|box_end|>')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
hotkey(key='')
type(content='') #If you want to submit your input, use \"\
\" at the end of `content`.
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left')
wait() #Sleep for 5s and take a screenshot to check for any changes.
finished()
call_user() # Submit the task and call the user when the task is unsolvable, or when you need the user's help.


## Note
- Use Chinese in `Thought` part.
- Summarize your next action (with its target element) in one sentence in `Thought` part.

## User Instruction
"""

Local Deployment [Ollama]

Ollama can deploy the model via gguf format. Bugs exist for safetensors.

Get the model in GGUF format

We provide 2B and 7B model in GGUF format:

2B: https://huggingface.co/bytedance-research/UI-TARS-2B-gguf

7B: https://huggingface.co/bytedance-research/UI-TARS-7B-gguf

Users can convert the model into GGUF format by using the script from llama.cpp:

python3 convert_hf_to_gguf.py <path to your model>

The GGUF file will be generated under the path provided.

Deploy GGUF model

We deploy the model by following Ollama tutorial.

# Create Modelfile, Windows users can just create a file named Modelfile
echo "FROM ./path/to/model.gguf" > Modelfile

# Create model in Ollama
ollama create ui-tars -f Modelfile

# Run the model
ollama run ui-tars

Test script is same as vLLM except two changes:

...
client = OpenAI(
    base_url="http://127.0.0.1:11434/v1/",
    ...
)
...
response = client.chat.completions.create(
    model="ui-tars" # the name we create via Ollama cli
    ...
)

Explanation of Inference Results

Coordinate Mapping

The model generates a 2D coordinate output that represents relative positions. To convert these values to image-relative coordinates, divide each component by 1000 to obtain values in the range [0,1]. The absolute coordinates required by the Action can be calculated by:

  • X absolute = X relative × image width
  • Y absolute = Y relative × image height

For example, given a screen size: 1920 × 1080, and the model generates a coordinate output of (235, 512). The X absolute is round(1920*235/1000)=451. The Y absolute is round(1080*512/1000)=553. The absolute coordinate is (451, 553)

Use in desktop and web automation

To experience ui-tars agent in desktop, you may refer to UI-TARS-desktop.

Midscene.js is an open-source web automation SDK that has supported UI-TARS model. Developers can use javascript and natural language to control the browser. See this guide for more details about setting up the model.

License

UI-TARS is licensed under the Apache License 2.0.

Acknowledgements

This project builds upon and extends the capabilities of Qwen-2-VL, a powerful vision-language model, which serves as the foundational architecture for UI-TARS. We would like to acknowledge the contributions of the developers and researchers behind Qwen-2-VL for their groundbreaking work in the field of multimodal AI and for providing a robust base for further advancements.

Additionally, we thank the broader open-source community for their datasets, tools, and insights that have facilitated the development of UI-TARS. These collaborative efforts continue to push the boundaries of what GUI automation and AI-driven agents can achieve.

Citation

If you find our paper and code useful in your research, feel free to give us a cite.

@article{uitars2025,
  author    = {Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Yang, Haifeng Liu, Feng Lin, Tao Peng, Xin Liu, Guang Shi},
  title     = {UI-TARS: Pioneering Automated GUI Interaction with Native Agents},
  journal   = {arXiv preprint arXiv:2501.12326},
  url       = {https://github.com/bytedance/UI-TARS},
  year      = {2025}
}