MingComplex commited on
Commit
e47bee1
·
1 Parent(s): e9ee3c6

update README.md

Browse files
Files changed (1) hide show
  1. README.md +45 -0
README.md CHANGED
@@ -1,3 +1,48 @@
1
  ---
2
  license: apache-2.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
  ---
4
+ # UI-TARS: Pioneering Automated GUI Interaction with Native Agents
5
+
6
+ ## Overview
7
+ UI-TARS is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and action capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, grounding, and memory—within a single vision-language model (VLM), enabling end-to-end task automation without predefined workflows or manual rules.
8
+
9
+ ## Core Features
10
+ ### Perception
11
+ - **Comprehensive GUI Understanding**: Processes multimodal inputs (text, images, interactions) to build a coherent understanding of interfaces.
12
+ - **Real-Time Interaction**: Continuously monitors dynamic GUIs and responds accurately to changes in real-time.
13
+
14
+ ### Action
15
+ - **Unified Action Space**: Standardized action definitions across platforms (desktop, mobile, and web).
16
+ - **Platform-Specific Actions**: Supports additional actions like hotkeys, long press, and platform-specific gestures.
17
+
18
+ ### Reasoning
19
+ - **System 1 & System 2 Reasoning**: Combines fast, intuitive responses with deliberate, high-level planning for complex tasks.
20
+ - **Task Decomposition & Reflection**: Supports multi-step planning, reflection, and error correction for robust task execution.
21
+
22
+ ### Memory
23
+ - **Short-Term Memory**: Captures task-specific context for situational awareness.
24
+ - **Long-Term Memory**: Retains historical interactions and knowledge for improved decision-making.
25
+
26
+ ## Capabilities
27
+ - **Cross-Platform Interaction**: Supports desktop, mobile, and web environments with a unified action framework.
28
+ - **Multi-Step Task Execution**: Trained to handle complex tasks through multi-step trajectories and reasoning.
29
+ - **Learning from Synthetic and Real Data**: Combines large-scale annotated and synthetic datasets for improved generalization and robustness.
30
+
31
+ ## Training Pipeline
32
+ 1. **Pre-Training**: Leveraging large-scale GUI-specific datasets for foundational learning.
33
+ 2. **Supervised Fine-Tuning**: Fine-tuning on human-annotated and synthetic multi-step task data.
34
+ 3. **Continual Learning**: Employing online trace bootstrapping and reinforcement learning for continual improvement.
35
+
36
+ ## Evaluation Metrics
37
+ - **Step-Level Metrics**: Element accuracy, operation F1 score, and step success rate.
38
+ - **Task-Level Metrics**: Complete match and partial match scores for overall task success.
39
+ - **Other Metrics**: Measures for execution efficiency, safety, robustness, and adaptability.
40
+
41
+
42
+ ## License
43
+ UI-TARS is licensed under the Apache License 2.0.
44
+
45
+ ## Acknowledgements
46
+ This project builds upon and extends the capabilities of Qwen-2-VL, a powerful vision-language model, which serves as the foundational architecture for UI-TARS. We would like to acknowledge the contributions of the developers and researchers behind Qwen-2-VL for their groundbreaking work in the field of multimodal AI and for providing a robust base for further advancements.
47
+
48
+ Additionally, we thank the broader open-source community for their datasets, tools, and insights that have facilitated the development of UI-TARS. These collaborative efforts continue to push the boundaries of what GUI automation and AI-driven agents can achieve.