Leopard-LLaVA

Paper | Github | Models-LLaVA | Models-Idefics2

Summaries

Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose Leopard, a MLLM designed specifically for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of the input images. Experiments across a wide range of benchmarks demonstrate our model's superior capabilities in text-rich, multi-image evaluations and competitive performance in general domain evaluations.

Architectures

For Leopard-LLaVA, we use SigLIP-SO-400M with 364 × 364 image resolutions as the visual encoder, as it supports a larger resolution than the commonly used 224 × 224 resolution CLIP visual encoder. Each image is encoded into a sequence of 26 × 26 = 676 visual features with a patch size of 14. With the visual feature pixel shuffling strategy, each image is further processed into a sequence of 169 visual features. We limit the maximum number of images (M) in each sample to 50, which produces up to 8,450 visual features in total. Following LLaVA, we adopt a two-layer MLP as the visual-language connector. We use LLaMA-3.1 as the language model.

Citation

@article{jia2024leopard,
  title={LEOPARD: A Vision Language Model For Text-Rich Multi-Image Tasks},
  author={Jia, Mengzhao and Yu, Wenhao and Ma, Kaixin and Fang, Tianqing and Zhang, Zhihan and Ouyang, Siru and Zhang, Hongming and Jiang, Meng and Yu, Dong},
  journal={arXiv preprint arXiv:2410.01744},
  year={2024}
}