cuierfei commited on
Commit
2575015
·
verified ·
1 Parent(s): ba9fdff

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +18 -18
README.md CHANGED
@@ -15,7 +15,7 @@ We released [🤗 InternVL-Chat-V1-1](https://huggingface.co/OpenGVLab/InternVL-
15
  As shown in the figure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.
16
 
17
  <p align="center">
18
- <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/HD29tU-g0An9FpQn1yK8X.png" style="width: 75%;">
19
  </p>
20
 
21
  In this version, we explored increasing the resolution to 448 × 448, enhancing OCR capabilities, and improving support for Chinese conversations. Since the 448 × 448 input image generates 1024 visual tokens after passing through the ViT, leading to a significant computational burden, we use a pixel shuffle (unshuffle) operation to reduce the 1024 tokens to 256 tokens.
@@ -56,23 +56,23 @@ This model can also conduct an in-depth analysis of AAAI's official website and
56
 
57
  ## Performance
58
 
59
- | model | LLaVA-1.5 | InternVL-Chat-V1-0 | InternVL-Chat-V1-0 | InternVL-Chat-V1-1 |
60
- | :----------------------------: | :----------: | :----------------: | :----------------: | :----------------: |
61
- | resolution | 336 | 336 | 448 | 448 |
62
- | vision encoder | CLIP-L-336px | InternViT-6B-224px | InternViT-6B-448px | InternViT-6B-448px |
63
- | language model | Vicuna-13B | Vicuna-13B | Vicuna-13B | LLaMA2-13B |
64
- | | | | | |
65
- | VQAv2<sub>testdev</sub> | 80.0 | 80.2 | 82.0 | 80.9 |
66
- | GQA<sub>testdev</sub> | 63.3 | 63.9 | 64.1 | 62.5 |
67
- | VizWiz<sub>test</sub> | 53.6 | 54.6 | 60.1 | 57.3 |
68
- | SQA<sub>test</sub> | 71.6 | 70.1 | 71.6 | 90.1 |
69
- | TextVQA<sub>val, w/o OCR</sub> | - | - | - | 64.2 |
70
- | TextVQA<sub>val, w/ OCR</sub> | 61.3 | 58.7 | 64.8 | 68.6 |
71
- | POPE | 85.9 | 87.1 | 87.2 | 87.1 |
72
- | MME<sub>perception</sub> | 1531.3 | 1546.9 | 1579.0 | 1659.8 |
73
- | MMB-EN<sub>test</sub> | 67.7 | 66.5 | 68.2 | 75.4 |
74
- | MMB-CN<sub>test</sub> | 63.6 | 61.9 | 64.0 | 70.3 |
75
- | MMVet<sub>GPT-4-0613</sub> | 35.4 | 33.7 | 36.7 | 46.7 |
76
 
77
  - Note that we use the [official evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator) to test the MMVet scores, with `GPT-4-0613` serving as the judge model. Using different versions of GPT-4 as the judge can result in significant score variations.
78
 
 
15
  As shown in the figure below, we connected our InternViT-6B to LLaMA2-13B through a simple MLP projector. Note that the LLaMA2-13B used here is not the original model but an internal chat version obtained by incrementally pre-training and fine-tuning the LLaMA2-13B base model for Chinese language tasks. Overall, our model has a total of 19 billion parameters.
16
 
17
  <p align="center">
18
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/HD29tU-g0An9FpQn1yK8X.png" style="width: 100%;">
19
  </p>
20
 
21
  In this version, we explored increasing the resolution to 448 × 448, enhancing OCR capabilities, and improving support for Chinese conversations. Since the 448 × 448 input image generates 1024 visual tokens after passing through the ViT, leading to a significant computational burden, we use a pixel shuffle (unshuffle) operation to reduce the 1024 tokens to 256 tokens.
 
56
 
57
  ## Performance
58
 
59
+ | model | LLaVA-1.5 | InternVL-Chat<br>-V1-0 | InternVL-Chat<br>-V1-0 | InternVL-Chat<br>-V1-1 |
60
+ | :----------------------------: | :----------: | :--------------------: | :--------------------: | :--------------------: |
61
+ | resolution | 336 | 336 | 448 | 448 |
62
+ | vision encoder | CLIP-L-336px | InternViT-6B-224px | InternViT-6B-448px | InternViT-6B-448px |
63
+ | language model | Vicuna-13B | Vicuna-13B | Vicuna-13B | LLaMA2-13B |
64
+ | | | | | |
65
+ | VQAv2<sub>testdev</sub> | 80.0 | 80.2 | 82.0 | 80.9 |
66
+ | GQA<sub>testdev</sub> | 63.3 | 63.9 | 64.1 | 62.5 |
67
+ | VizWiz<sub>test</sub> | 53.6 | 54.6 | 60.1 | 57.3 |
68
+ | SQA<sub>test</sub> | 71.6 | 70.1 | 71.6 | 90.1 |
69
+ | TextVQA<sub>val, w/o OCR</sub> | - | - | - | 64.2 |
70
+ | TextVQA<sub>val, w/ OCR</sub> | 61.3 | 58.7 | 64.8 | 68.6 |
71
+ | POPE | 85.9 | 87.1 | 87.2 | 87.1 |
72
+ | MME<sub>perception</sub> | 1531.3 | 1546.9 | 1579.0 | 1659.8 |
73
+ | MMB-EN<sub>test</sub> | 67.7 | 66.5 | 68.2 | 75.4 |
74
+ | MMB-CN<sub>test</sub> | 63.6 | 61.9 | 64.0 | 70.3 |
75
+ | MMVet<sub>GPT-4-0613</sub> | 35.4 | 33.7 | 36.7 | 46.7 |
76
 
77
  - Note that we use the [official evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator) to test the MMVet scores, with `GPT-4-0613` serving as the judge model. Using different versions of GPT-4 as the judge can result in significant score variations.
78