Qwen2-VL Collection Vision-language model series based on Qwen2 • 16 items • Updated Dec 6, 2024 • 200
EchoVideo: Identity-Preserving Human Video Generation by Multimodal Feature Fusion Paper • 2501.13452 • Published 12 days ago • 7
Temporal Preference Optimization for Long-Form Video Understanding Paper • 2501.13919 • Published 11 days ago • 21
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos Paper • 2501.13826 • Published 11 days ago • 22
AIMv2 Collection A collection of AIMv2 vision encoders that supports a number of resolutions, native resolution, and a distilled checkpoint. • 19 items • Updated Nov 22, 2024 • 71
BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks Paper • 2412.04626 • Published Dec 5, 2024 • 13
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection Paper • 2411.14794 • Published Nov 22, 2024 • 13
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training Paper • 2411.15124 • Published Nov 22, 2024 • 59
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models Paper • 2411.14432 • Published Nov 21, 2024 • 23
OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs Paper • 2411.14199 • Published Nov 21, 2024 • 30
Hymba: A Hybrid-head Architecture for Small Language Models Paper • 2411.13676 • Published Nov 20, 2024 • 41
Multimodal Autoregressive Pre-training of Large Vision Encoders Paper • 2411.14402 • Published Nov 21, 2024 • 43
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions Paper • 2411.14405 • Published Nov 21, 2024 • 58
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization Paper • 2411.10442 • Published Nov 15, 2024 • 73
Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines Paper • 2410.21220 • Published Oct 28, 2024 • 10