btjhjeon
's Collections
Multimodal LLM
updated
DocLLM: A layout-aware generative language model for multimodal document
understanding
Paper
•
2401.00908
•
Published
•
181
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved
Pre-Training
Paper
•
2401.00849
•
Published
•
15
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper
•
2311.05437
•
Published
•
48
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation,
Generation and Editing
Paper
•
2311.00571
•
Published
•
41
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
Paper
•
2401.02330
•
Published
•
14
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
Language, Audio, and Action
Paper
•
2312.17172
•
Published
•
27
Unified-IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
Paper
•
2206.08916
•
Published
•
1
ImageBind: One Embedding Space To Bind Them All
Paper
•
2305.05665
•
Published
•
5
Distilling Vision-Language Models on Millions of Videos
Paper
•
2401.06129
•
Published
•
15
LEGO:Language Enhanced Multi-modal Grounding Model
Paper
•
2401.06071
•
Published
•
10
Improving fine-grained understanding in image-text pre-training
Paper
•
2401.09865
•
Published
•
16
SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large
Language Models
Paper
•
2402.05935
•
Published
•
15
ViGoR: Improving Visual Grounding of Large Vision Language Models with
Fine-Grained Reward Modeling
Paper
•
2402.06118
•
Published
•
13
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned
Language Models
Paper
•
2402.07865
•
Published
•
12
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large
Vision-Language Models
Paper
•
2402.13577
•
Published
•
8
A Touch, Vision, and Language Dataset for Multimodal Alignment
Paper
•
2402.13232
•
Published
•
14
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper
•
2402.14289
•
Published
•
19
Enhancing Vision-Language Pre-training with Rich Supervisions
Paper
•
2403.03346
•
Published
•
14
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images
Paper
•
2403.11703
•
Published
•
16
HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal
Large Language Models
Paper
•
2403.13447
•
Published
•
18
When Do We Not Need Larger Vision Models?
Paper
•
2403.13043
•
Published
•
25
Cobra: Extending Mamba to Multi-Modal Large Language Model for Efficient
Inference
Paper
•
2403.14520
•
Published
•
33
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual
Math Problems?
Paper
•
2403.14624
•
Published
•
51
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
•
2403.07508
•
Published
•
74
Mini-Gemini: Mining the Potential of Multi-modality Vision Language
Models
Paper
•
2403.18814
•
Published
•
45
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
•
2404.06512
•
Published
•
29
OmniFusion Technical Report
Paper
•
2404.06212
•
Published
•
74
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper
•
2404.12390
•
Published
•
24
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
•
2404.12387
•
Published
•
38
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension
and Generation
Paper
•
2404.14396
•
Published
•
18
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
•
2404.16821
•
Published
•
55
SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
Text-Rich Visual Comprehension
Paper
•
2404.16790
•
Published
•
7
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
•
2404.16994
•
Published
•
35
What matters when building vision-language models?
Paper
•
2405.02246
•
Published
•
101
An Introduction to Vision-Language Modeling
Paper
•
2405.17247
•
Published
•
87
Matryoshka Multimodal Models
Paper
•
2405.17430
•
Published
•
31
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal
Models
Paper
•
2405.15738
•
Published
•
43
Needle In A Multimodal Haystack
Paper
•
2406.07230
•
Published
•
53
ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via
Chart-to-Code Generation
Paper
•
2406.09961
•
Published
•
54
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
Interleaved with Text
Paper
•
2406.08418
•
Published
•
28
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
Language Models
Paper
•
2406.09403
•
Published
•
19
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Paper
•
2406.08707
•
Published
•
15
CVQA: Culturally-diverse Multilingual Visual Question Answering
Benchmark
Paper
•
2406.05967
•
Published
•
5
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and
Instruction-Tuning Dataset for LVLMs
Paper
•
2406.11833
•
Published
•
61
mDPO: Conditional Preference Optimization for Multimodal Large Language
Models
Paper
•
2406.11839
•
Published
•
37
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
Dataset with One Trillion Tokens
Paper
•
2406.11271
•
Published
•
20
TokenPacker: Efficient Visual Projector for Multimodal LLM
Paper
•
2407.02392
•
Published
•
21
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
•
2407.02477
•
Published
•
21
Vision language models are blind
Paper
•
2407.06581
•
Published
•
82
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Paper
•
2406.16860
•
Published
•
59
ANOLE: An Open, Autoregressive, Native Large Multimodal Models for
Interleaved Image-Text Generation
Paper
•
2407.06135
•
Published
•
20
MAVIS: Mathematical Visual Instruction Tuning
Paper
•
2407.08739
•
Published
•
30
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning
Instruction Using Language Model
Paper
•
2407.07053
•
Published
•
42
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large
Multimodal Models
Paper
•
2407.07895
•
Published
•
40
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
Paper
•
2407.09413
•
Published
•
9
INF-LLaVA: Dual-perspective Perception for High-Resolution Multimodal
Large Language Model
Paper
•
2407.16198
•
Published
•
13
VILA^2: VILA Augmented VILA
Paper
•
2407.17453
•
Published
•
39
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Paper
•
2408.01800
•
Published
•
79
VITA: Towards Open-Source Interactive Omni Multimodal LLM
Paper
•
2408.05211
•
Published
•
47
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
Large Language Models
Paper
•
2408.04840
•
Published
•
32
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Paper
•
2408.08872
•
Published
•
98
LongVILA: Scaling Long-Context Visual Language Models for Long Videos
Paper
•
2408.10188
•
Published
•
51
Show-o: One Single Transformer to Unify Multimodal Understanding and
Generation
Paper
•
2408.12528
•
Published
•
50
Open-FinLLMs: Open Multimodal Large Language Models for Financial
Applications
Paper
•
2408.11878
•
Published
•
52
Building and better understanding vision-language models: insights and
future directions
Paper
•
2408.12637
•
Published
•
124
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
Encoders
Paper
•
2408.15998
•
Published
•
84
CogVLM2: Visual Language Models for Image and Video Understanding
Paper
•
2408.16500
•
Published
•
56
Law of Vision Representation in MLLMs
Paper
•
2408.16357
•
Published
•
92
LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Paper
•
2408.15881
•
Published
•
21
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
Hybrid Architecture
Paper
•
2409.02889
•
Published
•
55
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page
Document Understanding
Paper
•
2409.03420
•
Published
•
26
NVLM: Open Frontier-Class Multimodal LLMs
Paper
•
2409.11402
•
Published
•
72
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
Any Resolution
Paper
•
2409.12191
•
Published
•
74
Oryx MLLM: On-Demand Spatial-Temporal Understanding at Arbitrary
Resolution
Paper
•
2409.12961
•
Published
•
24
Phantom of Latent for Large Language and Vision Models
Paper
•
2409.14713
•
Published
•
27
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art
Multimodal Models
Paper
•
2409.17146
•
Published
•
104
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with
3D-awareness
Paper
•
2409.18125
•
Published
•
33
MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Paper
•
2409.20566
•
Published
•
53
MIO: A Foundation Model on Multimodal Tokens
Paper
•
2409.17692
•
Published
•
52
Emu3: Next-Token Prediction is All You Need
Paper
•
2409.18869
•
Published
•
93
LLaVA-Critic: Learning to Evaluate Multimodal Models
Paper
•
2410.02712
•
Published
•
35
LEOPARD : A Vision Language Model For Text-Rich Multi-Image Tasks
Paper
•
2410.01744
•
Published
•
26
TLDR: Token-Level Detective Reward Model for Large Vision Language
Models
Paper
•
2410.04734
•
Published
•
16
MLLM can see? Dynamic Correction Decoding for Hallucination Mitigation
Paper
•
2410.11779
•
Published
•
24
Baichuan-Omni Technical Report
Paper
•
2410.08565
•
Published
•
84
From Generalist to Specialist: Adapting Vision Language Models via
Task-Specific Visual Instruction Tuning
Paper
•
2410.06456
•
Published
•
35
Aria: An Open Multimodal Native Mixture-of-Experts Model
Paper
•
2410.05993
•
Published
•
107
Paper
•
2410.07073
•
Published
•
62
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding
and Generation
Paper
•
2410.13848
•
Published
•
31
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid
Visual Redundancy Reduction
Paper
•
2410.17247
•
Published
•
45
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper
•
2410.13861
•
Published
•
52
Pangea: A Fully Open Multilingual Multimodal LLM for 39 Languages
Paper
•
2410.16153
•
Published
•
43
DM-Codec: Distilling Multimodal Representations for Speech Tokenization
Paper
•
2410.15017
•
Published
•
1
Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex
Capabilities
Paper
•
2410.11190
•
Published
•
20
Distill Visual Chart Reasoning Ability from LLMs to MLLMs
Paper
•
2410.18798
•
Published
•
19
WAFFLE: Multi-Modal Model for Automated Front-End Development
Paper
•
2410.18362
•
Published
•
11
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language
Tuning
Paper
•
2410.17779
•
Published
•
7
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language
Understanding
Paper
•
2410.17434
•
Published
•
25
Document Parsing Unveiled: Techniques, Challenges, and Prospects for
Structured Information Extraction
Paper
•
2410.21169
•
Published
•
30
VideoWebArena: Evaluating Long Context Multimodal Agents with Video
Understanding Web Tasks
Paper
•
2410.19100
•
Published
•
6
Paper
•
2410.21276
•
Published
•
82
Infinity-MM: Scaling Multimodal Performance with Large-Scale and
High-Quality Instruction Data
Paper
•
2410.18558
•
Published
•
18
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Paper
•
2410.23218
•
Published
•
46
VideoGLaMM: A Large Multimodal Model for Pixel-Level Visual Grounding in
Videos
Paper
•
2411.04923
•
Published
•
20
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
•
2411.10440
•
Published
•
111
Awaker2.5-VL: Stably Scaling MLLMs with Parameter-Efficient Mixture of
Experts
Paper
•
2411.10669
•
Published
•
10
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
Language Models on Mobile Devices
Paper
•
2411.10640
•
Published
•
44
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
Paper
•
2411.17465
•
Published
•
76
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video
Comprehension with Video-Text Duet Interaction Format
Paper
•
2411.17991
•
Published
•
5
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Paper
•
2405.20797
•
Published
•
28
X-Prompt: Towards Universal In-Context Image Generation in
Auto-Regressive Vision Language Foundation Models
Paper
•
2412.01824
•
Published
•
65
VISTA: Enhancing Long-Duration and High-Resolution Video Understanding
by Video Spatiotemporal Augmentation
Paper
•
2412.00927
•
Published
•
26
On Domain-Specific Post-Training for Multimodal Large Language Models
Paper
•
2411.19930
•
Published
•
24
LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual
Preferences
Paper
•
2412.01292
•
Published
•
11
PaliGemma 2: A Family of Versatile VLMs for Transfer
Paper
•
2412.03555
•
Published
•
118
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and
Generation
Paper
•
2412.03069
•
Published
•
30
Video-3D LLM: Learning Position-Aware Video Representation for 3D Scene
Understanding
Paper
•
2412.00493
•
Published
•
16
VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models
Paper
•
2411.19103
•
Published
•
19
Expanding Performance Boundaries of Open-Source Multimodal Models with
Model, Data, and Test-Time Scaling
Paper
•
2412.05271
•
Published
•
121
NVILA: Efficient Frontier Visual Language Models
Paper
•
2412.04468
•
Published
•
54
Florence-VL: Enhancing Vision-Language Models with Generative Vision
Encoder and Depth-Breadth Fusion
Paper
•
2412.04424
•
Published
•
55
POINTS1.5: Building a Vision-Language Model towards Real World
Applications
Paper
•
2412.08443
•
Published
•
38
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity
Visual Descriptions
Paper
•
2412.08737
•
Published
•
51
Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper
•
2412.10360
•
Published
•
131
BiMediX2: Bio-Medical EXpert LMM for Diverse Medical Modalities
Paper
•
2412.07769
•
Published
•
26
Multimodal Latent Language Modeling with Next-Token Diffusion
Paper
•
2412.08635
•
Published
•
41
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary
Embedding Distillation
Paper
•
2412.09585
•
Published
•
10
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced
Multimodal Understanding
Paper
•
2412.10302
•
Published
•
7
SynerGen-VL: Towards Synergistic Image Understanding and Generation with
Vision Experts and Token Folding
Paper
•
2412.09604
•
Published
•
35
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of
Thought and Look-ahead Spatial Reasoning
Paper
•
2412.11974
•
Published
•
8
LLaVA-UHD v2: an MLLM Integrating High-Resolution Feature Pyramid via
Hierarchical Window Transformer
Paper
•
2412.13871
•
Published
•
17
Descriptive Caption Enhancement with Visual Specialists for Multimodal
Perception
Paper
•
2412.14233
•
Published
•
6
Diving into Self-Evolving Training for Multimodal Reasoning
Paper
•
2412.17451
•
Published
•
27