void0721 commited on
Commit
b1ad2db
·
1 Parent(s): 38cb900

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +50 -22
README.md CHANGED
@@ -7,8 +7,10 @@ Official implementation of ['SPHINX: A Mixer of Tasks, Domains, and Embeddings A
7
 
8
  Try out our [web demo 🚀](http://imagebind-llm.opengvlab.com/) here!
9
 
10
- ## News
11
- * **[2023-10-17]** We release the demo, code, and model of SPHINX 🎉.
 
 
12
 
13
  ## Introduction
14
 
@@ -23,37 +25,21 @@ We present SPHINX, a versatile multi-modal large language model (MLLM) with a mi
23
  <p align="left">
24
  <img src="figs/pipeline1.png"/ width="100%"> <br>
25
  </p>
 
 
26
  <p align="left">
27
  <img src="figs/pipeline2.png"/ width="100%"> <br>
28
  </p>
29
 
30
- ## Result
31
-
32
- **Evaluation Prompt Design**
33
- <p align="left">
34
- <img src="figs/table1.png"/ width="100%"> <br>
35
- </p>
36
 
37
- **Benchmarks on Multimodal Large Language Models**
38
- <p align="left">
39
- <img src="figs/table2.png"/ width="100%"> <br>
40
- </p
41
 
42
- **Visual Question Answering**
43
- <p align="left">
44
- <img src="figs/table3.png"/ width="100%"> <br>
45
- </p>
46
 
47
- **Visual Grounding**
48
- <p align="left">
49
- <img src="figs/table4.png"/ width="100%"> <br>
50
- </p>
51
 
52
  ## Inference
53
  This section provides a step-by-step guide for hosting a local SPHINX demo. If you're already familiar with the LLAMA2-Accessory toolkit, note that hosting a SPHINX demo follows the same pipeline as hosting demos for the other models supported by LLAMA2-Accessory.
54
 
55
- ### Installation
56
- SPHINX is built upon LLaMA2-Accessory, please follow the instructions [here](https://llama2-accessory.readthedocs.io/en/latest/install.html) for environment setup.
57
 
58
  ### Weights
59
  We provide the beta-version checkpoints on [HuggingFace🤗](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/tree/main/finetune/mm/sphinx-sft). Please download them to your own machine. The file structure should appear as follows:
@@ -77,3 +63,45 @@ Explanation of each argument:
77
  + `--tokenizer_path`: Path to the official LLaMA2 tokenizer. Note that the tokenizer file is the same for both LLaMA and LLaMA2. You may download it from [here](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/blob/main/config/tokenizer.model).
78
  + `--llama_type`: The model architecture of SPHINX is defined in [accessory/model/LLM/llama_ens.py](../accessory/model/LLM/llama_ens.py), and specifying `--llama_type=llama_ens ` tells the demo program to use this architecture.
79
  + `--pretrained_path`: The path to pre-trained checkpoint.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
 
8
  Try out our [web demo 🚀](http://imagebind-llm.opengvlab.com/) here!
9
 
10
+ <p align="left">
11
+ Github link: <a href="https://huggingface.co/Alpha-VLLM/SPHINX" target="_blank">Github</a> 👋 join our <a href="https://github.com/Alpha-VLLM/LLaMA2-Accessory/blob/main/docs/wechat.md" target="_blank">WeChat</a>
12
+ </p>
13
+
14
 
15
  ## Introduction
16
 
 
25
  <p align="left">
26
  <img src="figs/pipeline1.png"/ width="100%"> <br>
27
  </p>
28
+
29
+ On top of SPHINX, we propose to further mixvisual scales and sub-images for better capture fine-grained semantics on high-resolution images.
30
  <p align="left">
31
  <img src="figs/pipeline2.png"/ width="100%"> <br>
32
  </p>
33
 
 
 
 
 
 
 
34
 
35
+ ### Installation
36
+ SPHINX is built upon LLaMA2-Accessory, please follow the instructions [here](https://llama2-accessory.readthedocs.io/en/latest/install.html) for environment setup.
 
 
37
 
 
 
 
 
38
 
 
 
 
 
39
 
40
  ## Inference
41
  This section provides a step-by-step guide for hosting a local SPHINX demo. If you're already familiar with the LLAMA2-Accessory toolkit, note that hosting a SPHINX demo follows the same pipeline as hosting demos for the other models supported by LLAMA2-Accessory.
42
 
 
 
43
 
44
  ### Weights
45
  We provide the beta-version checkpoints on [HuggingFace🤗](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/tree/main/finetune/mm/sphinx-sft). Please download them to your own machine. The file structure should appear as follows:
 
63
  + `--tokenizer_path`: Path to the official LLaMA2 tokenizer. Note that the tokenizer file is the same for both LLaMA and LLaMA2. You may download it from [here](https://huggingface.co/Alpha-VLLM/LLaMA2-Accessory/blob/main/config/tokenizer.model).
64
  + `--llama_type`: The model architecture of SPHINX is defined in [accessory/model/LLM/llama_ens.py](../accessory/model/LLM/llama_ens.py), and specifying `--llama_type=llama_ens ` tells the demo program to use this architecture.
65
  + `--pretrained_path`: The path to pre-trained checkpoint.
66
+
67
+
68
+ ## Result
69
+
70
+ We provide a comprehensive evaluation of SPHINX and showcase results across multiple benchmarks.
71
+
72
+ Our evaluation encompasses both **quantitative metrics** and **qualitative assessments**, providing a holistic understanding of our VLM model's performance.
73
+
74
+ **Evaluation Prompt Design**
75
+ <p align="left">
76
+ <img src="figs/table1.png"/ width="100%"> <br>
77
+ </p>
78
+
79
+ * In evaluation, we prioritize aligning with each benchmark's desired output format.
80
+ * We employ distinct prompts tailored to benchmarks that necessitate long answers, short answers, and multiple-choice responses.
81
+ * For tasks involving visual grounding, we directly utilize the prompts during training to enhance the model's performance on these particular challenges.
82
+
83
+ **Benchmarks on Multimodal Large Language Models**
84
+ <p align="left">
85
+ <img src="figs/table2.png"/ width="100%"> <br>
86
+ </p
87
+
88
+ * We test our model on recently proposed MLLM benchmarks which is based on VQA to comprehensive evaluation of the model's characteristic such as MME, Seedbench, POPE, LLaVA-Bench (In-the-Wild), MM-Vet, MathVista, MMbench, CCbench.
89
+ * The Long-SPHINX achieve new stat of arts result on 5 out of 9 benchmarks
90
+
91
+ **Visual Question Answering**
92
+ <p align="left">
93
+ <img src="figs/table3.png"/ width="100%"> <br>
94
+ </p>
95
+
96
+ * We evaluate general VQA benchmarks, such as VQAV2, OKVQA, GQA, vizwiz, scienceQA, visual spatial reasoning (VSR), IconQA.
97
+ * Additionally, we conduct experiments on Text-oriented VQA such as TextVQA,OCR-VQA.
98
+ * Long-Sphinx achieve comparative results across all benchmarks. We observe that Long-Sphinx outperforms Sphinx in VQA datasets that demand fine-grained visual information, showcasing the effectiveness of our visual mixed-up approach for achieving high resolution without relying on a visual encoder trained specifically on high-resolution images.
99
+
100
+ **Visual Grounding**
101
+ <p align="left">
102
+ <img src="figs/table4.png"/ width="100%"> <br>
103
+ </p>
104
+
105
+ * The SPHINX model and baseline models on REC benchmarks results on table4.
106
+ * SPHINX exhibits robust performance in visual grounding tasks such as RefCOCO, RefCOCO+, and RefCOCOg, **surpassing other vision-language generalist models**.
107
+ * Notably, SPHINX outperforms specialist models G-DINO-L by **more than 1.54%** in accuracy across all tasks within RefCOCO/RefCOCO+/RefCOCOg.