File size: 6,001 Bytes
874cd50 147c7f2 438eb49 147c7f2 b1ad2db f16b96a b1ad2db 147c7f2 38cb900 147c7f2 f03b15c 147c7f2 08631f9 38cb900 a9d26e3 b1ad2db cfc9360 08631f9 38cb900 147c7f2 38cb900 b1ad2db 38cb900 438eb49 147c7f2 30c224b 147c7f2 a66ff9f 147c7f2 b1ad2db a66ff9f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 |
---
license: mit
---
# 🔥 SPHINX: A Mixer of Tasks, Domains, and Embeddings
Official implementation of ['SPHINX: A Mixer of Tasks, Domains, and Embeddings Advances Multi-modal Large Language Models'](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX).
Try out our [web demo 🚀](http://imagebind-llm.opengvlab.com/) here!
<p align="left">
Github link: <a href="https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX" target="_blank">Github</a> • 👋 join our <a href="https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/docs/wechat.md" target="_blank">WeChat</a>
</p>
## Introduction
We present SPHINX, a versatile multi-modal large language model (MLLM) with a mixer of training tasks, data domains, and visual embeddings.
- **Task Mix.** For all-purpose capabilities, we mix a variety of vision-language tasks for mutual improvement: VQA, REC, REG, OCR, DET, POSE, REL DET, T2I, etc.
- **Embedding Mix.** We capture robust visual representations by fusing distinct visual architectures, pre-training, and granularity.
- **Domain Mix.** For data from real-world and synthetic domains, we mix the weights of two domain-specific models for complementarity.
<p align="left">
<img src="figs/pipeline1.png"/ width="100%"> <br>
</p>
On top of SPHINX, we propose to further mix visual scales and sub-images for better capture fine-grained semantics on high-resolution images.
<p align="left">
<img src="figs/pipeline2.png"/ width="100%"> <br>
</p>
### Installation
SPHINX is built upon LLaMA2-Accessory, please follow the instructions [here](https://llama2-accessory.readthedocs.io/en/latest/install.html) for environment setup.
## Inference
This section provides a step-by-step guide for hosting a local SPHINX demo. If you're already familiar with the LLAMA2-Accessory toolkit, note that hosting a SPHINX demo follows the same pipeline as hosting demos for the other models supported by LLAMA2-Accessory.
### Weights
We provide the beta-version checkpoints on [HuggingFace🤗](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/tree/main/finetune/mm/SPHINX). Please download them to your own machine. The file structure should appear as follows:
```
ckpt_path/
├── consolidated.00-of-02.model.pth
└── consolidated.01-of-02.model.pth
```
### Host Local Demo
Please follow the instructions [here](https://github.com/Alpha-VLLM/LLaMA2-Accessory/tree/main/SPHINX#host-local-demo) to see the instruction and complete the use of the model.
## Result
We provide a comprehensive evaluation of SPHINX and showcase results across multiple benchmarks.
Our evaluation encompasses both **quantitative metrics** and **qualitative assessments**, providing a holistic understanding of our VLM model's performance.
**Evaluation Prompt Design**
<p align="left">
<img src="figs/table1.png"/ width="100%"> <br>
</p>
* In evaluation, we prioritize aligning with each benchmark's desired output format.
* We employ distinct prompts tailored to benchmarks that necessitate long answers, short answers, and multiple-choice responses.
* For tasks involving visual grounding, we directly utilize the prompts during training to enhance the model's performance on these particular challenges.
**Benchmarks on Multimodal Large Language Models**
<p align="left">
<img src="figs/table2.png"/ width="100%"> <br>
</p
* We test our model on recently proposed MLLM benchmarks which is based on VQA to comprehensive evaluation of the model's characteristic such as MME, Seedbench, POPE, LLaVA-Bench (In-the-Wild), MM-Vet, MathVista, MMbench, CCbench.
* The Long-SPHINX achieve new stat of arts result on 5 out of 9 benchmarks
**Visual Question Answering**
<p align="left">
<img src="figs/table3.png"/ width="100%"> <br>
</p>
* We evaluate general VQA benchmarks, such as VQAV2, OKVQA, GQA, vizwiz, scienceQA, visual spatial reasoning (VSR), IconQA.
* Additionally, we conduct experiments on Text-oriented VQA such as TextVQA,OCR-VQA.
* Long-Sphinx achieve comparative results across all benchmarks. We observe that Long-Sphinx outperforms Sphinx in VQA datasets that demand fine-grained visual information, showcasing the effectiveness of our visual mixed-up approach for achieving high resolution without relying on a visual encoder trained specifically on high-resolution images.
**Visual Grounding**
<p align="left">
<img src="figs/table4.png"/ width="100%"> <br>
</p>
* The SPHINX model and baseline models on REC benchmarks results on table4.
* SPHINX exhibits robust performance in visual grounding tasks such as RefCOCO, RefCOCO+, and RefCOCOg, **surpassing other vision-language generalist models**.
* Notably, SPHINX outperforms specialist models G-DINO-L by **more than 1.54%** in accuracy across all tasks within RefCOCO/RefCOCO+/RefCOCOg.
## Frequently Asked Questions (FAQ)
❓ Encountering issues or have further questions? Find answers to common inquiries [here](https://llama2-accessory.readthedocs.io/en/latest/faq.html). We're here to assist you!
## License
Llama 2 is licensed under the [LLAMA 2 Community License](LICENSE_llama2), Copyright (c) Meta Platforms, Inc. All Rights Reserved.
|