arxiv:2404.07973

Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models

Published on Apr 11

· Submitted by

akhaliq on Apr 12

Upvote

Authors:

Haotian Zhang ,

Haoxuan You ,

Philipp Dufter ,

Hong-You Chen ,

Tsu-Jui Fu ,

William Yang Wang ,

Zhe Gan ,

Yinfei Yang

Abstract

While Ferret seamlessly integrates regional understanding into the Large Language Model (LLM) to facilitate its referring and grounding capability, it poses certain limitations: constrained by the pre-trained fixed visual encoder and failed to perform well on broader tasks. In this work, we unveil Ferret-v2, a significant upgrade to Ferret, with three key designs. (1) Any resolution grounding and referring: A flexible approach that effortlessly handles higher image resolution, improving the model's ability to process and understand images in greater detail. (2) Multi-granularity visual encoding: By integrating the additional DINOv2 encoder, the model learns better and diverse underlying contexts for global and fine-grained visual information. (3) A three-stage training paradigm: Besides image-caption alignment, an additional stage is proposed for high-resolution dense alignment before the final instruction tuning. Experiments show that Ferret-v2 provides substantial improvements over Ferret and other state-of-the-art methods, thanks to its high-resolution scaling and fine-grained visual processing.

View arXiv page View PDF Add to collection

Community

nicolay-r

Apr 12

Excited to running into your findings on training recipes, modality encoders, experiments on resolution scaling. Well done 👏.
A quick question on mentioned transparency of the Ferret-v2: is it already on Github?