Upload folder using huggingface_hub
Browse files
README.md
CHANGED
@@ -15,7 +15,7 @@ We released [🤗 InternVL-Chat-V1-1](https://huggingface.co/OpenGVLab/InternVL-
|
|
15 |
As shown in the figure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.
|
16 |
|
17 |
<p align="center">
|
18 |
-
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/HD29tU-g0An9FpQn1yK8X.png" style="width:
|
19 |
</p>
|
20 |
|
21 |
In this version, we explored increasing the resolution to 448 × 448, enhancing OCR capabilities, and improving support for Chinese conversations. Since the 448 × 448 input image generates 1024 visual tokens after passing through the ViT, leading to a significant computational burden, we use a pixel shuffle (unshuffle) operation to reduce the 1024 tokens to 256 tokens.
|
@@ -56,23 +56,23 @@ This model can also conduct an in-depth analysis of AAAI's official website and
|
|
56 |
|
57 |
## Performance
|
58 |
|
59 |
-
| model | LLaVA-1.5 | InternVL-Chat
|
60 |
-
| :----------------------------: | :----------: |
|
61 |
-
| resolution | 336 |
|
62 |
-
| vision encoder | CLIP-L-336px |
|
63 |
-
| language model | Vicuna-13B |
|
64 |
-
| | |
|
65 |
-
| VQAv2<sub>testdev</sub> | 80.0 |
|
66 |
-
| GQA<sub>testdev</sub> | 63.3 |
|
67 |
-
| VizWiz<sub>test</sub> | 53.6 |
|
68 |
-
| SQA<sub>test</sub> | 71.6 |
|
69 |
-
| TextVQA<sub>val, w/o OCR</sub> | - |
|
70 |
-
| TextVQA<sub>val, w/ OCR</sub> | 61.3 |
|
71 |
-
| POPE | 85.9 |
|
72 |
-
| MME<sub>perception</sub> | 1531.3 |
|
73 |
-
| MMB-EN<sub>test</sub> | 67.7 |
|
74 |
-
| MMB-CN<sub>test</sub> | 63.6 |
|
75 |
-
| MMVet<sub>GPT-4-0613</sub> | 35.4 |
|
76 |
|
77 |
- Note that we use the [official evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator) to test the MMVet scores, with `GPT-4-0613` serving as the judge model. Using different versions of GPT-4 as the judge can result in significant score variations.
|
78 |
|
|
|
15 |
As shown in the figure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.
|
16 |
|
17 |
<p align="center">
|
18 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/HD29tU-g0An9FpQn1yK8X.png" style="width: 100%;">
|
19 |
</p>
|
20 |
|
21 |
In this version, we explored increasing the resolution to 448 × 448, enhancing OCR capabilities, and improving support for Chinese conversations. Since the 448 × 448 input image generates 1024 visual tokens after passing through the ViT, leading to a significant computational burden, we use a pixel shuffle (unshuffle) operation to reduce the 1024 tokens to 256 tokens.
|
|
|
56 |
|
57 |
## Performance
|
58 |
|
59 |
+
| model | LLaVA-1.5 | InternVL-Chat<br>-V1-0 | InternVL-Chat<br>-V1-0 | InternVL-Chat<br>-V1-1 |
|
60 |
+
| :----------------------------: | :----------: | :--------------------: | :--------------------: | :--------------------: |
|
61 |
+
| resolution | 336 | 336 | 448 | 448 |
|
62 |
+
| vision encoder | CLIP-L-336px | InternViT-6B-224px | InternViT-6B-448px | InternViT-6B-448px |
|
63 |
+
| language model | Vicuna-13B | Vicuna-13B | Vicuna-13B | LLaMA2-13B |
|
64 |
+
| | | | | |
|
65 |
+
| VQAv2<sub>testdev</sub> | 80.0 | 80.2 | 82.0 | 80.9 |
|
66 |
+
| GQA<sub>testdev</sub> | 63.3 | 63.9 | 64.1 | 62.5 |
|
67 |
+
| VizWiz<sub>test</sub> | 53.6 | 54.6 | 60.1 | 57.3 |
|
68 |
+
| SQA<sub>test</sub> | 71.6 | 70.1 | 71.6 | 90.1 |
|
69 |
+
| TextVQA<sub>val, w/o OCR</sub> | - | - | - | 64.2 |
|
70 |
+
| TextVQA<sub>val, w/ OCR</sub> | 61.3 | 58.7 | 64.8 | 68.6 |
|
71 |
+
| POPE | 85.9 | 87.1 | 87.2 | 87.1 |
|
72 |
+
| MME<sub>perception</sub> | 1531.3 | 1546.9 | 1579.0 | 1659.8 |
|
73 |
+
| MMB-EN<sub>test</sub> | 67.7 | 66.5 | 68.2 | 75.4 |
|
74 |
+
| MMB-CN<sub>test</sub> | 63.6 | 61.9 | 64.0 | 70.3 |
|
75 |
+
| MMVet<sub>GPT-4-0613</sub> | 35.4 | 33.7 | 36.7 | 46.7 |
|
76 |
|
77 |
- Note that we use the [official evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator) to test the MMVet scores, with `GPT-4-0613` serving as the judge model. Using different versions of GPT-4 as the judge can result in significant score variations.
|
78 |
|