--- language: - en license: mit pipeline_tag: video-text-to-text datasets: - Kangheng/video-utr-7b-hf --- # Video-UTR-7B Model Card ## 📄 Model details **Model type:** Video-UTR, as a new family of state-of-the-art video-MLLMs, is designed based on our proposed Unhackable Temporal Rewarding (UTR) under the LLaVA-NeXT architecture. UTR is a novel video-language modeling strategy guided by two princeples of our established temporal hacking theory, which contains two key innovations: 1. Spatiotemporal Attributes: Extracts trajectory, identity and action features from video frames through a series of expert models to establish arrtibution trajectories. 2. Bidirectional Querying: Perform bidirectional querying of temporal and spatial attributes to generate dialogue data to inforce learning spatiotemporal dynamics. ![pipeline](https://ooo.0x0.ooo/2025/02/14/OGQbwI.jpg) ![pipeline](https://ooo.0x0.ooo/2025/02/14/OGQx1D.png) **Paper or resources for more information:https://github.com/linkangheng/Video-UTR** ## 📚 Training dataset ![training dataset](https://ooo.0x0.ooo/2025/02/15/OG0BQ1.png) ## 📊 Main Performance ![video bmk](https://ooo.0x0.ooo/2025/02/14/OGQzHF.png) ![image bmk](https://ooo.0x0.ooo/2025/02/14/OGQF86.png) ## 🚀 How to use the model First, make sure to have `transformers >= 4.42.0`. The model supports multi-visual and multi-prompt generation. Meaning that you can pass multiple images/videos in your prompt. Make sure also to follow the correct prompt template (`USER: xxx\nASSISTANT:`) and add the token `` or `