--- license: apache-2.0 language: - en pipeline_tag: image-text-to-text tags: - multimodal - gui - llama-cpp - gguf-my-repo library_name: transformers base_model: bytedance-research/UI-TARS-72B-SFT --- note: most qwen2 weights aren't divisible by 256, so this is really a q8/q5 quant. # main-horse/UI-TARS-72B-SFT-Q4_K_M-GGUF This model was converted to GGUF format from [`bytedance-research/UI-TARS-72B-SFT`](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT) using llama.cpp. Refer to the [original model card](https://huggingface.co/bytedance-research/UI-TARS-72B-SFT) for more details on the model. ## Use with llama.cpp Install llama.cpp through brew (works on Mac and Linux) ```bash brew install llama.cpp ``` Invoke the llama.cpp server or the CLI. ### CLI: ```bash llama-cli --hf-repo main-horse/UI-TARS-72B-SFT-Q4_K_M-GGUF --hf-file UI-TARS-72B-SFT.Q4_K_M.gguf -p "The meaning to life and the universe is" ``` ### Server: ```bash llama-server --hf-repo main-horse/UI-TARS-72B-SFT-Q4_K_M-GGUF --hf-file UI-TARS-72B-SFT.Q4_K_M.gguf -c 2048 ``` Note: You can also use this checkpoint directly through the [usage steps](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage) listed in the Llama.cpp repo as well. Step 1: Clone llama.cpp from GitHub. ``` git clone https://github.com/ggerganov/llama.cpp cd llama.cpp ``` Step 2: Build using CMake. ``` cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_F16=1 -DGGML_CUDA_FA_ALL_QUANTS=1 -DCMAKE_CUDA_ARCHITECTURES=... cmake --build build --config Release -j ``` Step 3: Run inference through the main binary. ``` ./llama-server --hf-repo main-horse/UI-TARS-72B-SFT-Q4_K_M-GGUF --hf-file UI-TARS-72B-SFT.Q4_K_M.gguf -c 2048 ```