TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding Paper • 2502.19400 • Published 9 days ago • 42
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language Models (VLMs) via Reinforcement Learning Paper • 2502.19634 • Published 8 days ago • 56
Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think Paper • 2502.20172 • Published 8 days ago • 25
UniTok: A Unified Tokenizer for Visual Generation and Understanding Paper • 2502.20321 • Published 8 days ago • 26
Mobius: Text to Seamless Looping Video Generation via Latent Shift Paper • 2502.20307 • Published 8 days ago • 16
FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute Paper • 2502.20126 • Published 8 days ago • 19
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference Paper • 2502.18411 • Published 10 days ago • 69
Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs Paper • 2502.19413 • Published 9 days ago • 19
How far can we go with ImageNet for Text-to-Image generation? Paper • 2502.21318 • Published 7 days ago • 25
SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers Paper • 2502.20545 • Published 7 days ago • 19
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs Paper • 2503.01743 • Published 4 days ago • 59
Introducing Visual Perception Token into Multimodal Large Language Model Paper • 2502.17425 • Published 11 days ago • 14
Slamming: Training a Speech Language Model on One GPU in a Day Paper • 2502.15814 • Published 16 days ago • 66
VideoGrain: Modulating Space-Time Attention for Multi-grained Video Editing Paper • 2502.17258 • Published 11 days ago • 71
GCC: Generative Color Constancy via Diffusing a Color Checker Paper • 2502.17435 • Published 11 days ago • 27
Tell me why: Visual foundation models as self-explainable classifiers Paper • 2502.19577 • Published 8 days ago • 10