Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper β’ 2501.12599 β’ Published 18 days ago β’ 90
VideoLLaMA3 Collection Frontier Multimodal Foundation Models for Video Understanding β’ 14 items β’ Updated 2 days ago β’ 11
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Paper β’ 2501.13106 β’ Published 18 days ago β’ 79
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding Paper β’ 2501.12380 β’ Published 19 days ago β’ 81
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models Paper β’ 2501.03262 β’ Published Jan 4 β’ 90
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Paper β’ 2501.00958 β’ Published Jan 1 β’ 99
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper β’ 2501.00599 β’ Published Dec 31, 2024 β’ 41
PixMo Collection A set of vision-language datasets built by Ai2 and used to train the Molmo family of models. Read more at https://molmo.allenai.org/blog β’ 9 items β’ Updated Jan 6 β’ 56
Inf-CL Collection The corresponding demos/checkpoints/papers/datasets of Inf-CL. β’ 2 items β’ Updated 16 days ago β’ 3
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models Paper β’ 2410.23266 β’ Published Oct 30, 2024 β’ 20
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss Paper β’ 2410.17243 β’ Published Oct 22, 2024 β’ 89
Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective Paper β’ 2410.12490 β’ Published Oct 16, 2024 β’ 8
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio Paper β’ 2410.12787 β’ Published Oct 16, 2024 β’ 31
A Controlled Study on Long Context Extension and Generalization in LLMs Paper β’ 2409.12181 β’ Published Sep 18, 2024 β’ 44