multimodal - a zzfive Collection

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Paper • 2406.07476 • Published Jun 11, 2024 • 35

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

Paper • 2406.08407 • Published Jun 12, 2024 • 27

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation

Paper • 2406.07686 • Published Jun 11, 2024 • 16

OpenVLA: An Open-Source Vision-Language-Action Model

Paper • 2406.09246 • Published Jun 13, 2024 • 37

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Paper • 2406.09403 • Published Jun 13, 2024 • 21

Explore the Limits of Omni-modal Pretraining at Scale

Paper • 2406.09412 • Published Jun 13, 2024 • 11

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Paper • 2406.09406 • Published Jun 13, 2024 • 15

ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation

Paper • 2406.09961 • Published Jun 14, 2024 • 55

Needle In A Multimodal Haystack

Paper • 2406.07230 • Published Jun 11, 2024 • 53

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text

Paper • 2406.08418 • Published Jun 12, 2024 • 29

mDPO: Conditional Preference Optimization for Multimodal Large Language Models

Paper • 2406.11839 • Published Jun 17, 2024 • 38

LLaNA: Large Language and NeRF Assistant

Paper • 2406.11840 • Published Jun 17, 2024 • 18

Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs

Paper • 2406.14544 • Published Jun 20, 2024 • 35

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

Paper • 2406.13923 • Published Jun 20, 2024 • 23

Improving Visual Commonsense in Language Models via Multiple Image Generation

Paper • 2406.13621 • Published Jun 19, 2024 • 13

Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report

Paper • 2406.11403 • Published Jun 17, 2024 • 4

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

Paper • 2406.16758 • Published Jun 24, 2024 • 20

Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Paper • 2406.16860 • Published Jun 24, 2024 • 60

Long Context Transfer from Language to Vision

Paper • 2406.16852 • Published Jun 24, 2024 • 33

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Paper • 2406.15704 • Published Jun 22, 2024 • 5

HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale

Paper • 2406.19280 • Published Jun 27, 2024 • 62

Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity

Paper • 2406.17720 • Published Jun 25, 2024 • 8

OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents

Paper • 2407.00114 • Published Jun 27, 2024 • 12

Understanding Alignment in Multimodal LLMs: A Comprehensive Study

Paper • 2407.02477 • Published Jul 2, 2024 • 23

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Paper • 2407.03320 • Published Jul 3, 2024 • 94

TokenPacker: Efficient Visual Projector for Multimodal LLM

Paper • 2407.02392 • Published Jul 2, 2024 • 22

Unveiling Encoder-Free Vision-Language Models

Paper • 2406.11832 • Published Jun 17, 2024 • 52

RULE: Reliable Multimodal RAG for Factuality in Medical Vision Language Models

Paper • 2407.05131 • Published Jul 6, 2024 • 25

ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

Paper • 2407.04172 • Published Jul 4, 2024 • 24

Stark: Social Long-Term Multi-Modal Conversation with Persona Commonsense Knowledge

Paper • 2407.03958 • Published Jul 4, 2024 • 20

HEMM: Holistic Evaluation of Multimodal Foundation Models

Paper • 2407.03418 • Published Jul 3, 2024 • 10

ANOLE: An Open, Autoregressive, Native Large Multimodal Models for Interleaved Image-Text Generation

Paper • 2407.06135 • Published Jul 8, 2024 • 21

Vision language models are blind

Paper • 2407.06581 • Published Jul 9, 2024 • 83

Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

Paper • 2407.06189 • Published Jul 8, 2024 • 26

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Paper • 2407.07895 • Published Jul 10, 2024 • 40

PaliGemma: A versatile 3B VLM for transfer

Paper • 2407.07726 • Published Jul 10, 2024 • 68

FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models

Paper • 2407.11522 • Published Jul 16, 2024 • 9

VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models

Paper • 2407.11691 • Published Jul 16, 2024 • 14

OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

Paper • 2407.11895 • Published Jul 16, 2024 • 7

Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development

Paper • 2407.11784 • Published Jul 16, 2024 • 4

E5-V: Universal Embeddings with Multimodal Large Language Models

Paper • 2407.12580 • Published Jul 17, 2024 • 40

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Paper • 2407.12772 • Published Jul 17, 2024 • 34

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Paper • 2407.12679 • Published Jul 17, 2024 • 8

EVLM: An Efficient Vision-Language Model for Visual Understanding

Paper • 2407.14177 • Published Jul 19, 2024 • 43

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Paper • 2407.15841 • Published Jul 22, 2024 • 40

VideoGameBunny: Towards vision assistants for video games

Paper • 2407.15295 • Published Jul 21, 2024 • 22

MIBench: Evaluating Multimodal Large Language Models over Multiple Images

Paper • 2407.15272 • Published Jul 21, 2024 • 10

Visual Haystacks: Answering Harder Questions About Sets of Images

Paper • 2407.13766 • Published Jul 18, 2024 • 2

INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal Large Language Model

Paper • 2407.16198 • Published Jul 23, 2024 • 13

VILA^2: VILA Augmented VILA

Paper • 2407.17453 • Published Jul 24, 2024 • 40

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Paper • 2407.18121 • Published Jul 25, 2024 • 17

Wolf: Captioning Everything with a World Summarization Framework

Paper • 2407.18908 • Published Jul 26, 2024 • 32

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Paper • 2407.21770 • Published Jul 31, 2024 • 22

OmniParser for Pure Vision Based GUI Agent

Paper • 2408.00203 • Published Aug 1, 2024 • 25

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Paper • 2408.01800 • Published Aug 3, 2024 • 82

Language Model Can Listen While Speaking

Paper • 2408.02622 • Published Aug 5, 2024 • 41

ExoViP: Step-by-step Verification and Exploration with Exoskeleton Modules for Compositional Visual Reasoning

Paper • 2408.02210 • Published Aug 5, 2024 • 9

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

Paper • 2408.02718 • Published Aug 5, 2024 • 61

VITA: Towards Open-Source Interactive Omni Multimodal LLM

Paper • 2408.05211 • Published Aug 9, 2024 • 47

VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

Paper • 2408.06327 • Published Aug 12, 2024 • 17

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Paper • 2408.08872 • Published Aug 16, 2024 • 99

LongVILA: Scaling Long-Context Visual Language Models for Long Videos

Paper • 2408.10188 • Published Aug 19, 2024 • 51

Segment Anything with Multiple Modalities

Paper • 2408.09085 • Published Aug 17, 2024 • 22

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

Paper • 2408.11039 • Published Aug 20, 2024 • 59

GRAB: A Challenging GRaph Analysis Benchmark for Large Multimodal Models

Paper • 2408.11817 • Published Aug 21, 2024 • 9

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Paper • 2408.12528 • Published Aug 22, 2024 • 51

SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models

Paper • 2408.12114 • Published Aug 22, 2024 • 14

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

Paper • 2408.11813 • Published Aug 21, 2024 • 12

Building and better understanding vision-language models: insights and future directions

Paper • 2408.12637 • Published Aug 22, 2024 • 126

CogVLM2: Visual Language Models for Image and Video Understanding

Paper • 2408.16500 • Published Aug 29, 2024 • 57

LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation

Paper • 2408.15881 • Published Aug 28, 2024 • 21

Law of Vision Representation in MLLMs

Paper • 2408.16357 • Published Aug 29, 2024 • 93

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios

Paper • 2408.17267 • Published Aug 30, 2024 • 23

VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images

Paper • 2408.16176 • Published Aug 28, 2024 • 8

Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming

Paper • 2408.16725 • Published Aug 29, 2024 • 53

VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges

Paper • 2409.01071 • Published Sep 2, 2024 • 27

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

Paper • 2409.02889 • Published Sep 4, 2024 • 54

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Paper • 2409.02813 • Published Sep 4, 2024 • 29

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding

Paper • 2409.03420 • Published Sep 5, 2024 • 26

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

Paper • 2409.05840 • Published Sep 9, 2024 • 48

POINTS: Improving Your Vision-language Model with Affordable Strategies

Paper • 2409.04828 • Published Sep 7, 2024 • 24

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

Paper • 2409.06666 • Published Sep 10, 2024 • 57

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Paper • 2409.09269 • Published Sep 14, 2024 • 9

NVLM: Open Frontier-Class Multimodal LLMs

Paper • 2409.11402 • Published Sep 17, 2024 • 73

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Paper • 2409.12191 • Published Sep 18, 2024 • 76

Putting Data at the Centre of Offline Multi-Agent Reinforcement Learning

Paper • 2409.12001 • Published Sep 18, 2024 • 4

MonoFormer: One Transformer for Both Diffusion and Autoregression

Paper • 2409.16280 • Published Sep 24, 2024 • 18

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

Paper • 2409.17146 • Published Sep 25, 2024 • 108

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Paper • 2409.18042 • Published Sep 26, 2024 • 38

Emu3: Next-Token Prediction is All You Need

Paper • 2409.18869 • Published Sep 27, 2024 • 94

MIO: A Foundation Model on Multimodal Tokens

Paper • 2409.17692 • Published Sep 26, 2024 • 53

UniMuMo: Unified Text, Music and Motion Generation

Paper • 2410.04534 • Published Oct 6, 2024 • 19

NL-Eye: Abductive NLI for Images

Paper • 2410.02613 • Published Oct 3, 2024 • 23

Pixtral 12B

Paper • 2410.07073 • Published Oct 9, 2024 • 64

Personalized Visual Instruction Tuning

Paper • 2410.07113 • Published Oct 9, 2024 • 70

Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate

Paper • 2410.07167 • Published Oct 9, 2024 • 38

Aria: An Open Multimodal Native Mixture-of-Experts Model

Paper • 2410.05993 • Published Oct 8, 2024 • 109

Multimodal Situational Safety

Paper • 2410.06172 • Published Oct 8, 2024 • 11

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Paper • 2410.02740 • Published Oct 3, 2024 • 52

Video Instruction Tuning With Synthetic Data

Paper • 2410.02713 • Published Oct 3, 2024 • 38

LLaVA-Critic: Learning to Evaluate Multimodal Models

Paper • 2410.02712 • Published Oct 3, 2024 • 35

Distilling an End-to-End Voice Assistant Without Instruction Training Data

Paper • 2410.02678 • Published Oct 3, 2024 • 23

MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents

Paper • 2410.03450 • Published Oct 4, 2024 • 36

Baichuan-Omni Technical Report

Paper • 2410.08565 • Published Oct 11, 2024 • 85

From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning

Paper • 2410.06456 • Published Oct 9, 2024 • 36

LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models

Paper • 2410.09732 • Published Oct 13, 2024 • 55

MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models

Paper • 2410.10139 • Published Oct 14, 2024 • 52

MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

Paper • 2410.10563 • Published Oct 14, 2024 • 38

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Paper • 2410.10594 • Published Oct 14, 2024 • 26

TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models

Paper • 2410.10818 • Published Oct 14, 2024 • 17

TVBench: Redesigning Video-Language Evaluation

Paper • 2410.07752 • Published Oct 10, 2024 • 6

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio

Paper • 2410.12787 • Published Oct 16, 2024 • 31

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

Paper • 2410.13085 • Published Oct 16, 2024 • 22

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation

Paper • 2410.13848 • Published Oct 17, 2024 • 34

MixEval-X: Any-to-Any Evaluations from Real-World Data Mixtures

Paper • 2410.13754 • Published Oct 17, 2024 • 75

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines

Paper • 2410.12705 • Published Oct 16, 2024 • 32

Remember, Retrieve and Generate: Understanding Infinite Visual Concepts as Your Personalized Assistant

Paper • 2410.13360 • Published Oct 17, 2024 • 9

γ-MoD: Exploring Mixture-of-Depth Adaptation for Multimodal Large Language Models

Paper • 2410.13859 • Published Oct 17, 2024 • 8

NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples

Paper • 2410.14669 • Published Oct 18, 2024 • 37

Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Paper • 2410.11190 • Published Oct 15, 2024 • 22

PUMA: Empowering Unified MLLM with Multi-granular Visual Generation

Paper • 2410.13861 • Published Oct 17, 2024 • 53

Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages

Paper • 2410.16153 • Published Oct 21, 2024 • 44

Improve Vision Language Model Chain-of-thought Reasoning

Paper • 2410.16198 • Published Oct 21, 2024 • 26

Mitigating Object Hallucination via Concentric Causal Attention

Paper • 2410.15926 • Published Oct 21, 2024 • 17

MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models

Paper • 2410.17637 • Published Oct 23, 2024 • 35

Can Knowledge Editing Really Correct Hallucinations?

Paper • 2410.16251 • Published Oct 21, 2024 • 55

Unbounded: A Generative Infinite Game of Character Life Simulation

Paper • 2410.18975 • Published Oct 24, 2024 • 37

WAFFLE: Multi-Modal Model for Automated Front-End Development

Paper • 2410.18362 • Published Oct 24, 2024 • 13

ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting

Paper • 2410.17856 • Published Oct 23, 2024 • 49

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Paper • 2410.18558 • Published Oct 24, 2024 • 19

GPT-4o System Card

Paper • 2410.21276 • Published Oct 25, 2024 • 84

Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines

Paper • 2410.21220 • Published Oct 28, 2024 • 10

VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks

Paper • 2410.19100 • Published Oct 24, 2024 • 6

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Paper • 2410.23218 • Published Oct 30, 2024 • 47

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

Paper • 2410.23266 • Published Oct 30, 2024 • 20

DynaMath: A Dynamic Visual Benchmark for Evaluating Mathematical Reasoning Robustness of Vision Language Models

Paper • 2411.00836 • Published Oct 29, 2024 • 15

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

Paper • 2411.02327 • Published Nov 4, 2024 • 11

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

Paper • 2411.04996 • Published Nov 7, 2024 • 51

VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in Videos

Paper • 2411.04923 • Published Nov 7, 2024 • 21

Analyzing The Language of Visual Tokens

Paper • 2411.05001 • Published Nov 7, 2024 • 24

M-Longdoc: A Benchmark For Multimodal Super-Long Document Understanding And A Retrieval-Aware Tuning Framework

Paper • 2411.06176 • Published Nov 9, 2024 • 45

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

Paper • 2411.10440 • Published Nov 15, 2024 • 114

Generative World Explorer

Paper • 2411.11844 • Published Nov 18, 2024 • 76

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices

Paper • 2411.10640 • Published Nov 16, 2024 • 45

Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of Experts

Paper • 2411.10669 • Published Nov 16, 2024 • 10

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation

Paper • 2411.13281 • Published Nov 20, 2024 • 20

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Paper • 2411.10442 • Published Nov 15, 2024 • 76

Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large Language Models

Paper • 2411.14432 • Published Nov 21, 2024 • 23

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models

Paper • 2411.14982 • Published Nov 22, 2024 • 16

GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI

Paper • 2411.14522 • Published Nov 21, 2024 • 34

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Paper • 2411.17465 • Published Nov 26, 2024 • 80

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Paper • 2411.15296 • Published Nov 22, 2024 • 20

Rethinking Token Reduction in MLLMs: Towards a Unified Paradigm for Training-Free Acceleration

Paper • 2411.17686 • Published Nov 26, 2024 • 20

VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models

Paper • 2411.17451 • Published Nov 26, 2024 • 11

FINECAPTION: Compositional Image Captioning Focusing on Wherever You Want at Any Granularity

Paper • 2411.15411 • Published Nov 23, 2024 • 8

ChatRex: Taming Multimodal LLM for Joint Perception and Understanding

Paper • 2411.18363 • Published Nov 27, 2024 • 10

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Paper • 2411.17991 • Published Nov 27, 2024 • 5

Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning

Paper • 2411.18203 • Published Nov 27, 2024 • 34

ChatGen: Automatic Text-to-Image Generation From FreeStyle Chatting

Paper • 2411.17176 • Published Nov 26, 2024 • 23

On Domain-Specific Post-Training for Multimodal Large Language Models

Paper • 2411.19930 • Published Nov 29, 2024 • 27

X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models

Paper • 2412.01824 • Published Dec 2, 2024 • 65

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

Paper • 2412.01800 • Published Dec 2, 2024 • 6

OmniCreator: Self-Supervised Unified Generation with Universal Editing

Paper • 2412.02114 • Published Dec 3, 2024 • 14

PaliGemma 2: A Family of Versatile VLMs for Transfer

Paper • 2412.03555 • Published Dec 4, 2024 • 129

VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding

Paper • 2412.02186 • Published Dec 3, 2024 • 22

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Paper • 2412.03565 • Published Dec 4, 2024 • 11

VisionZip: Longer is Better but Not Necessary in Vision Language Models

Paper • 2412.04467 • Published Dec 5, 2024 • 107

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion

Paper • 2412.04424 • Published Dec 5, 2024 • 61

NVILA: Efficient Frontier Visual Language Models

Paper • 2412.04468 • Published Dec 5, 2024 • 58

Personalized Multimodal Large Language Models: A Survey

Paper • 2412.02142 • Published Dec 3, 2024 • 14

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

Paper • 2412.01169 • Published Dec 2, 2024 • 13

p-MoD: Building Mixture-of-Depths MLLMs via Progressive Ratio Decay

Paper • 2412.04449 • Published Dec 5, 2024 • 7

CompCap: Improving Multimodal Large Language Models with Composite Captions

Paper • 2412.05243 • Published Dec 6, 2024 • 19

Maya: An Instruction Finetuned Multilingual Multimodal Model

Paper • 2412.07112 • Published Dec 10, 2024 • 27

ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance

Paper • 2412.06673 • Published Dec 9, 2024 • 11

POINTS1.5: Building a Vision-Language Model towards Real World Applications

Paper • 2412.08443 • Published Dec 11, 2024 • 38

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Paper • 2412.09596 • Published Dec 12, 2024 • 94

Multimodal Latent Language Modeling with Next-Token Diffusion

Paper • 2412.08635 • Published Dec 11, 2024 • 44

Lyra: An Efficient and Speech-Centric Framework for Omni-Cognition

Paper • 2412.09501 • Published Dec 12, 2024 • 45

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Paper • 2412.10360 • Published Dec 13, 2024 • 140

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

Paper • 2412.09604 • Published Dec 12, 2024 • 35

VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation

Paper • 2412.10704 • Published Dec 14, 2024 • 15

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Paper • 2412.14171 • Published Dec 18, 2024 • 24

Progressive Multimodal Reasoning via Active Retrieval

Paper • 2412.14835 • Published Dec 19, 2024 • 73

MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval

Paper • 2412.14475 • Published Dec 19, 2024 • 53

Diving into Self-Evolving Training for Multimodal Reasoning

Paper • 2412.17451 • Published Dec 23, 2024 • 43

Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via Collective Monte Carlo Tree Search

Paper • 2412.18319 • Published Dec 24, 2024 • 37

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

Paper • 2412.18619 • Published Dec 16, 2024 • 55

Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment

Paper • 2412.19326 • Published Dec 26, 2024 • 18

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Paper • 2501.00958 • Published Jan 1 • 100

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

Paper • 2501.01957 • Published Jan 3 • 42

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

Paper • 2501.01904 • Published Jan 3 • 32

Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction

Paper • 2501.03218 • Published Jan 6 • 36

Cosmos World Foundation Model Platform for Physical AI

Paper • 2501.03575 • Published Jan 7 • 69

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7 • 50

OpenOmni: Large Language Models Pivot Zero-shot Omnimodal Alignment across Language with Real-time Self-Aware Emotional Speech Synthesis

Paper • 2501.04561 • Published Jan 8 • 16

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7 • 25

VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10 • 68

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Paper • 2501.06186 • Published Jan 10 • 61

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Paper • 2501.05510 • Published Jan 9 • 39

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Paper • 2501.05452 • Published Jan 9 • 15

Infecting Generative AI With Viruses

Paper • 2501.05542 • Published Jan 9 • 13

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Paper • 2501.05767 • Published Jan 10 • 28

MinMo: A Multimodal Large Language Model for Seamless Voice Interaction

Paper • 2501.06282 • Published Jan 10 • 48

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

Paper • 2501.07888 • Published Jan 14 • 15

Do generative video models learn physical principles from watching videos?

Paper • 2501.09038 • Published Jan 14 • 32

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

Paper • 2501.12368 • Published Jan 21 • 42

MSTS: A Multimodal Safety Test Suite for Vision-Language Models

Paper • 2501.10057 • Published Jan 17 • 8

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Paper • 2501.13106 • Published Jan 22 • 84

Temporal Preference Optimization for Long-Form Video Understanding

Paper • 2501.13919 • Published Jan 23 • 22

Baichuan-Omni-1.5 Technical Report

Paper • 2501.15368 • Published Jan 26 • 61

Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity

Paper • 2501.16295 • Published Jan 27 • 8

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Paper • 2502.01341 • Published Feb 3 • 36

Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment

Paper • 2502.04328 • Published 28 days ago • 28

EVEv2: Improved Baselines for Encoder-Free Vision-Language Models

Paper • 2502.06788 • Published 24 days ago • 12

Scaling Pre-training to One Hundred Billion Data for Vision Language Models

Paper • 2502.07617 • Published 23 days ago • 29

Magma: A Foundation Model for Multimodal AI Agents

Paper • 2502.13130 • Published 16 days ago • 55

Qwen2.5-VL Technical Report

Paper • 2502.13923 • Published 15 days ago • 156

Slamming: Training a Speech Language Model on One GPU in a Day

Paper • 2502.15814 • Published 15 days ago • 66

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Paper • 2502.18411 • Published 9 days ago • 69

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

Paper • 2502.17422 • Published 10 days ago • 7

Introducing Visual Perception Token into Multimodal Large Language Model

Paper • 2502.17425 • Published 10 days ago • 14