Large Motion Video Autoencoding with Cross-modal Video VAE Paper • 2412.17805 • Published 2 days ago • 19
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published 19 days ago • 121
Apollo: An Exploration of Video Understanding in Large Multimodal Models Paper • 2412.10360 • Published 12 days ago • 131
STIV: Scalable Text and Image Conditioned Video Generation Paper • 2412.07730 • Published 15 days ago • 69
PaliGemma 2: A Family of Versatile VLMs for Transfer Paper • 2412.03555 • Published 21 days ago • 118
Open-Sora Plan: Open-Source Large Video Generation Model Paper • 2412.00131 • Published 27 days ago • 32
Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations Paper • 2410.10792 • Published Oct 14 • 29
SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory Paper • 2411.11922 • Published Nov 18 • 18
OminiControl: Minimal and Universal Control for Diffusion Transformer Paper • 2411.15098 • Published Nov 22 • 53
TÜLU 3: Pushing Frontiers in Open Language Model Post-Training Paper • 2411.15124 • Published Nov 22 • 56
Multimodal Autoregressive Pre-training of Large Vision Encoders Paper • 2411.14402 • Published Nov 21 • 43
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions Paper • 2411.07461 • Published Nov 12 • 21
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion Paper • 2411.04928 • Published Nov 7 • 48
SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers Paper • 2401.08740 • Published Jan 16 • 12
LLaVA-Video Collection Models focus on video understanding (previously known as LLaVA-NeXT-Video). • 6 items • Updated Oct 5 • 55