Update README.md
Browse files
README.md
CHANGED
@@ -12,7 +12,7 @@ Try out our [web demo 🚀](http://imagebind-llm.opengvlab.com/) here!
|
|
12 |
|
13 |
## Introduction
|
14 |
|
15 |
-
We present
|
16 |
|
17 |
- **Task Mix.** For all-purpose capabilities, we mix a variety of vision-language tasks for mutual improvement: VQA, REC, REG, OCR, etc.
|
18 |
|
@@ -21,24 +21,32 @@ We present $\color{goldenrod}{SPHINX}$, a versatile multi-modal large language m
|
|
21 |
- **Domain Mix.** For data from real-world and synthetic domains, we mix the weights of two domain-specific models for complementarity.
|
22 |
|
23 |
<p align="left">
|
24 |
-
<img src="figs/pipeline1.png"/ width="
|
25 |
</p>
|
26 |
<p align="left">
|
27 |
-
<img src="figs/pipeline2.png"/ width="
|
28 |
</p>
|
29 |
|
30 |
## Result
|
|
|
|
|
31 |
<p align="left">
|
32 |
-
<img src="figs/table1.png"/ width="
|
33 |
</p>
|
|
|
|
|
34 |
<p align="left">
|
35 |
-
<img src="figs/table2.png"/ width="
|
36 |
-
</p
|
|
|
|
|
37 |
<p align="left">
|
38 |
-
<img src="figs/table3.png"/ width="
|
39 |
</p>
|
|
|
|
|
40 |
<p align="left">
|
41 |
-
<img src="figs/table4.png"/ width="
|
42 |
</p>
|
43 |
|
44 |
## Inference
|
|
|
12 |
|
13 |
## Introduction
|
14 |
|
15 |
+
We present SPHINX, a versatile multi-modal large language model (MLLM) with a mixer of training tasks, data domains, and visual embeddings.
|
16 |
|
17 |
- **Task Mix.** For all-purpose capabilities, we mix a variety of vision-language tasks for mutual improvement: VQA, REC, REG, OCR, etc.
|
18 |
|
|
|
21 |
- **Domain Mix.** For data from real-world and synthetic domains, we mix the weights of two domain-specific models for complementarity.
|
22 |
|
23 |
<p align="left">
|
24 |
+
<img src="figs/pipeline1.png"/ width="100%"> <br>
|
25 |
</p>
|
26 |
<p align="left">
|
27 |
+
<img src="figs/pipeline2.png"/ width="100%"> <br>
|
28 |
</p>
|
29 |
|
30 |
## Result
|
31 |
+
|
32 |
+
**Evaluation Prompt Design**
|
33 |
<p align="left">
|
34 |
+
<img src="figs/table1.png"/ width="100%"> <br>
|
35 |
</p>
|
36 |
+
|
37 |
+
**Benchmarks on Multimodal Large Language Models**
|
38 |
<p align="left">
|
39 |
+
<img src="figs/table2.png"/ width="100%"> <br>
|
40 |
+
</p
|
41 |
+
|
42 |
+
**Visual Question Answering**
|
43 |
<p align="left">
|
44 |
+
<img src="figs/table3.png"/ width="100%"> <br>
|
45 |
</p>
|
46 |
+
|
47 |
+
**Visual Grounding**
|
48 |
<p align="left">
|
49 |
+
<img src="figs/table4.png"/ width="100%"> <br>
|
50 |
</p>
|
51 |
|
52 |
## Inference
|