Update README.md
Browse files
README.md
CHANGED
@@ -7,8 +7,10 @@ Official implementation of ['SPHINX: A Mixer of Tasks, Domains, and Embeddings A
|
|
7 |
|
8 |
Try out our [web demo 🚀](http://imagebind-llm.opengvlab.com/) here!
|
9 |
|
10 |
-
|
11 |
-
|
|
|
|
|
12 |
|
13 |
## Introduction
|
14 |
|
@@ -23,37 +25,21 @@ We present SPHINX, a versatile multi-modal large language model (MLLM) with a mi
|
|
23 |
<p align="left">
|
24 |
<img src="figs/pipeline1.png"/ width="100%"> <br>
|
25 |
</p>
|
|
|
|
|
26 |
<p align="left">
|
27 |
<img src="figs/pipeline2.png"/ width="100%"> <br>
|
28 |
</p>
|
29 |
|
30 |
-
## Result
|
31 |
-
|
32 |
-
**Evaluation Prompt Design**
|
33 |
-
<p align="left">
|
34 |
-
<img src="figs/table1.png"/ width="100%"> <br>
|
35 |
-
</p>
|
36 |
|
37 |
-
|
38 |
-
|
39 |
-
<img src="figs/table2.png"/ width="100%"> <br>
|
40 |
-
</p
|
41 |
|
42 |
-
**Visual Question Answering**
|
43 |
-
<p align="left">
|
44 |
-
<img src="figs/table3.png"/ width="100%"> <br>
|
45 |
-
</p>
|
46 |
|
47 |
-
**Visual Grounding**
|
48 |
-
<p align="left">
|
49 |
-
<img src="figs/table4.png"/ width="100%"> <br>
|
50 |
-
</p>
|
51 |
|
52 |
## Inference
|
53 |
This section provides a step-by-step guide for hosting a local SPHINX demo. If you're already familiar with the LLAMA2-Accessory toolkit, note that hosting a SPHINX demo follows the same pipeline as hosting demos for the other models supported by LLAMA2-Accessory.
|
54 |
|
55 |
-
### Installation
|
56 |
-
SPHINX is built upon LLaMA2-Accessory, please follow the instructions [here](https://llama2-accessory.readthedocs.io/en/latest/install.html) for environment setup.
|
57 |
|
58 |
### Weights
|
59 |
We provide the beta-version checkpoints on [HuggingFace🤗](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/tree/main/finetune/mm/sphinx-sft). Please download them to your own machine. The file structure should appear as follows:
|
@@ -77,3 +63,45 @@ Explanation of each argument:
|
|
77 |
+ `--tokenizer_path`: Path to the official LLaMA2 tokenizer. Note that the tokenizer file is the same for both LLaMA and LLaMA2. You may download it from [here](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/blob/main/config/tokenizer.model).
|
78 |
+ `--llama_type`: The model architecture of SPHINX is defined in [accessory/model/LLM/llama_ens.py](../accessory/model/LLM/llama_ens.py), and specifying `--llama_type=llama_ens ` tells the demo program to use this architecture.
|
79 |
+ `--pretrained_path`: The path to pre-trained checkpoint.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
7 |
|
8 |
Try out our [web demo 🚀](http://imagebind-llm.opengvlab.com/) here!
|
9 |
|
10 |
+
<p align="left">
|
11 |
+
Github link: <a href="https://huggingface.co/Alpha-VLLM/SPHINX" target="_blank">Github</a> • 👋 join our <a href="https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/docs/wechat.md" target="_blank">WeChat</a>
|
12 |
+
</p>
|
13 |
+
|
14 |
|
15 |
## Introduction
|
16 |
|
|
|
25 |
<p align="left">
|
26 |
<img src="figs/pipeline1.png"/ width="100%"> <br>
|
27 |
</p>
|
28 |
+
|
29 |
+
On top of SPHINX, we propose to further mixvisual scales and sub-images for better capture fine-grained semantics on high-resolution images.
|
30 |
<p align="left">
|
31 |
<img src="figs/pipeline2.png"/ width="100%"> <br>
|
32 |
</p>
|
33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
34 |
|
35 |
+
### Installation
|
36 |
+
SPHINX is built upon LLaMA2-Accessory, please follow the instructions [here](https://llama2-accessory.readthedocs.io/en/latest/install.html) for environment setup.
|
|
|
|
|
37 |
|
|
|
|
|
|
|
|
|
38 |
|
|
|
|
|
|
|
|
|
39 |
|
40 |
## Inference
|
41 |
This section provides a step-by-step guide for hosting a local SPHINX demo. If you're already familiar with the LLAMA2-Accessory toolkit, note that hosting a SPHINX demo follows the same pipeline as hosting demos for the other models supported by LLAMA2-Accessory.
|
42 |
|
|
|
|
|
43 |
|
44 |
### Weights
|
45 |
We provide the beta-version checkpoints on [HuggingFace🤗](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/tree/main/finetune/mm/sphinx-sft). Please download them to your own machine. The file structure should appear as follows:
|
|
|
63 |
+ `--tokenizer_path`: Path to the official LLaMA2 tokenizer. Note that the tokenizer file is the same for both LLaMA and LLaMA2. You may download it from [here](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/blob/main/config/tokenizer.model).
|
64 |
+ `--llama_type`: The model architecture of SPHINX is defined in [accessory/model/LLM/llama_ens.py](../accessory/model/LLM/llama_ens.py), and specifying `--llama_type=llama_ens ` tells the demo program to use this architecture.
|
65 |
+ `--pretrained_path`: The path to pre-trained checkpoint.
|
66 |
+
|
67 |
+
|
68 |
+
## Result
|
69 |
+
|
70 |
+
We provide a comprehensive evaluation of SPHINX and showcase results across multiple benchmarks.
|
71 |
+
|
72 |
+
Our evaluation encompasses both **quantitative metrics** and **qualitative assessments**, providing a holistic understanding of our VLM model's performance.
|
73 |
+
|
74 |
+
**Evaluation Prompt Design**
|
75 |
+
<p align="left">
|
76 |
+
<img src="figs/table1.png"/ width="100%"> <br>
|
77 |
+
</p>
|
78 |
+
|
79 |
+
* In evaluation, we prioritize aligning with each benchmark's desired output format.
|
80 |
+
* We employ distinct prompts tailored to benchmarks that necessitate long answers, short answers, and multiple-choice responses.
|
81 |
+
* For tasks involving visual grounding, we directly utilize the prompts during training to enhance the model's performance on these particular challenges.
|
82 |
+
|
83 |
+
**Benchmarks on Multimodal Large Language Models**
|
84 |
+
<p align="left">
|
85 |
+
<img src="figs/table2.png"/ width="100%"> <br>
|
86 |
+
</p
|
87 |
+
|
88 |
+
* We test our model on recently proposed MLLM benchmarks which is based on VQA to comprehensive evaluation of the model's characteristic such as MME, Seedbench, POPE, LLaVA-Bench (In-the-Wild), MM-Vet, MathVista, MMbench, CCbench.
|
89 |
+
* The Long-SPHINX achieve new stat of arts result on 5 out of 9 benchmarks
|
90 |
+
|
91 |
+
**Visual Question Answering**
|
92 |
+
<p align="left">
|
93 |
+
<img src="figs/table3.png"/ width="100%"> <br>
|
94 |
+
</p>
|
95 |
+
|
96 |
+
* We evaluate general VQA benchmarks, such as VQAV2, OKVQA, GQA, vizwiz, scienceQA, visual spatial reasoning (VSR), IconQA.
|
97 |
+
* Additionally, we conduct experiments on Text-oriented VQA such as TextVQA,OCR-VQA.
|
98 |
+
* Long-Sphinx achieve comparative results across all benchmarks. We observe that Long-Sphinx outperforms Sphinx in VQA datasets that demand fine-grained visual information, showcasing the effectiveness of our visual mixed-up approach for achieving high resolution without relying on a visual encoder trained specifically on high-resolution images.
|
99 |
+
|
100 |
+
**Visual Grounding**
|
101 |
+
<p align="left">
|
102 |
+
<img src="figs/table4.png"/ width="100%"> <br>
|
103 |
+
</p>
|
104 |
+
|
105 |
+
* The SPHINX model and baseline models on REC benchmarks results on table4.
|
106 |
+
* SPHINX exhibits robust performance in visual grounding tasks such as RefCOCO, RefCOCO+, and RefCOCOg, **surpassing other vision-language generalist models**.
|
107 |
+
* Notably, SPHINX outperforms specialist models G-DINO-L by **more than 1.54%** in accuracy across all tasks within RefCOCO/RefCOCO+/RefCOCOg.
|