Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
Abstract
While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.
Community
Excited to running into your findings on training recipes, modality encoders, experiments on resolution scaling. Well done ๐.
A quick question on mentioned transparency of the Ferret-v2: is it already on Github?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Griffon v2: Advancing Multimodal Perception with High-Resolution Scaling and Visual-Language Co-Referring (2024)
- InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding (2024)
- The (R)Evolution of Multimodal Large Language Models: A Survey (2024)
- RegionGPT: Towards Region Understanding Vision Language Model (2024)
- Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Ferret-v2: Next-Level Referring and Grounding with Enhanced LLMs!
Links ๐:
๐ Subscribe: https://www.youtube.com/@Arxflix
๐ Twitter: https://x.com/arxflix
๐ LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper