Update README.md
Browse files
README.md
CHANGED
@@ -10,13 +10,19 @@ datasets:
|
|
10 |
pipeline_tag: visual-question-answering
|
11 |
---
|
12 |
|
13 |
-
# Model Card for InternVL-Chat-
|
14 |
|
15 |
-
|
16 |
|
17 |
-
\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
18 |
|
19 |
-
InternVL scales up the ViT to _**6B parameters**_ and aligns it with LLM.
|
20 |
|
21 |
## InternVL-Chat-V1.2 Blog
|
22 |
|
@@ -42,20 +48,20 @@ For more details about data preparation, please see [here](https://github.com/Op
|
|
42 |
\* Proprietary Model
|
43 |
|
44 |
| name | image size | MMMU<br>(val) | MMMU<br>(test) | MathVista<br>(testmini) | MMB<br>(test) | MMB−CN<br>(test) | MMVP | MME | ScienceQA<br>(image) | POPE | TextVQA<br>(val) | SEEDv1<br>(image) | VizWiz<br>(test) | GQA<br>(test) |
|
45 |
-
| ------------------ | ---------- | ------------- | -------------- | ----------------------- | ------------- | ---------------- | ---- | -------- | -------------------- | ---- |
|
46 |
-
| GPT−4V\* | unknown | 56.8 | 55.7 | 49.9 | 77.0 | 74.4 | 38.7 | 1409/517 | - | - | 78.0
|
47 |
-
| Gemini Ultra\* | unknown | 59.4 | - | 53.0 | - | - | - | - | - | - | 82.3
|
48 |
-
| Gemini Pro\* | unknown | 47.9 | - | 45.2 | 73.6 | 74.3 | 40.7 | 1497/437 | - | - | 74.6
|
49 |
-
| Qwen−VL−Plus\* | unknown | 45.2 | 40.8 | 43.3 | 67.0 | 70.7 | - | 1681/502 | - | - | 78.9
|
50 |
-
| Qwen−VL−Max\* | unknown | 51.4 | 46.8 | 51.0 | 77.6 | 75.7 | - | - | - | - | 79.5
|
51 |
-
| | | | | | | | | | | |
|
52 |
-
| LLaVA−NEXT−34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5
|
53 |
-
| InternVL−Chat−V1.2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 |
|
54 |
-
|
55 |
-
- MMBench results are collected from the [leaderboard](https://mmbench.opencompass.org.cn/leaderboard).
|
56 |
- In most benchmarks, InternVL-Chat-V1.2 achieves better performance than LLaVA-NeXT-34B.
|
|
|
57 |
|
58 |
-
### Training (
|
59 |
|
60 |
We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
|
61 |
|
@@ -69,32 +75,28 @@ The hyperparameters used for finetuning are listed in the following table.
|
|
69 |
|
70 |
|
71 |
## Model Details
|
72 |
-
- **Model Type:**
|
73 |
- **Model Stats:**
|
74 |
- Architecture: [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)
|
|
|
75 |
- Params: 40B
|
76 |
-
- Image size: 448 x 448
|
77 |
-
- Number of visual tokens: 256
|
78 |
|
79 |
- **Training Strategy:**
|
80 |
- Pretraining Stage
|
81 |
- Learnable Component: MLP
|
82 |
- Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
83 |
- Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
84 |
-
-
|
85 |
- Learnable Component: ViT + MLP + LLM
|
86 |
-
- Data: A simplified, fully open-source dataset, containing approximately 1 million samples.
|
87 |
|
88 |
|
89 |
## Model Usage
|
90 |
|
91 |
-
We provide
|
92 |
|
93 |
You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
|
94 |
|
95 |
-
Note: If you meet this error `ImportError: This modeling file requires the following packages that were not found in your environment: fastchat`, please run `pip install fschat`.
|
96 |
-
|
97 |
-
|
98 |
```python
|
99 |
import torch
|
100 |
from PIL import Image
|
@@ -145,7 +147,6 @@ response, history = model.chat(tokenizer, pixel_values, question, generation_con
|
|
145 |
print(question, response)
|
146 |
```
|
147 |
|
148 |
-
|
149 |
## Citation
|
150 |
|
151 |
If you find this project useful in your research, please consider citing:
|
|
|
10 |
pipeline_tag: visual-question-answering
|
11 |
---
|
12 |
|
13 |
+
# Model Card for InternVL-Chat-V1.2
|
14 |
|
15 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/64119264f0f81eb569e0d569/Sx8dq7ReqSLOgvA_oTmXL.webp" alt="Image Description" width="300" height="300">
|
16 |
|
17 |
+
\[[Paper](https://arxiv.org/abs/2312.14238)\] \[[GitHub](https://github.com/OpenGVLab/InternVL)\] \[[Chat Demo](https://internvl.opengvlab.com/)\] \[[中文解读](https://zhuanlan.zhihu.com/p/675877376)]
|
18 |
+
|
19 |
+
| Model | Date | Download | Note |
|
20 |
+
| ----------------------- | ---------- | ------------------------------------------------------------------------------------ | ---------------------------------- |
|
21 |
+
| InternVL-Chat-V1.5 | 2024.04.18 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5) | support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc. (🔥new)|
|
22 |
+
| InternVL-Chat-V1.2-Plus | 2024.02.21 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2-Plus) | more SFT data and stronger |
|
23 |
+
| InternVL-Chat-V1.2 | 2024.02.11 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-2) | scaling up LLM to 34B |
|
24 |
+
| InternVL-Chat-V1.1 | 2024.01.24 | 🤗 [HF link](https://huggingface.co/OpenGVLab/InternVL-Chat-Chinese-V1-1) | support Chinese and stronger OCR |
|
25 |
|
|
|
26 |
|
27 |
## InternVL-Chat-V1.2 Blog
|
28 |
|
|
|
48 |
\* Proprietary Model
|
49 |
|
50 |
| name | image size | MMMU<br>(val) | MMMU<br>(test) | MathVista<br>(testmini) | MMB<br>(test) | MMB−CN<br>(test) | MMVP | MME | ScienceQA<br>(image) | POPE | TextVQA<br>(val) | SEEDv1<br>(image) | VizWiz<br>(test) | GQA<br>(test) |
|
51 |
+
| ------------------ | ---------- | ------------- | -------------- | ----------------------- | ------------- | ---------------- | ---- | -------- | -------------------- | ---- | ---------------- | ----------------- | ---------------- | ------------- |
|
52 |
+
| GPT−4V\* | unknown | 56.8 | 55.7 | 49.9 | 77.0 | 74.4 | 38.7 | 1409/517 | - | - | 78.0 | 71.6 | - | - |
|
53 |
+
| Gemini Ultra\* | unknown | 59.4 | - | 53.0 | - | - | - | - | - | - | 82.3 | - | - | - |
|
54 |
+
| Gemini Pro\* | unknown | 47.9 | - | 45.2 | 73.6 | 74.3 | 40.7 | 1497/437 | - | - | 74.6 | 70.7 | - | - |
|
55 |
+
| Qwen−VL−Plus\* | unknown | 45.2 | 40.8 | 43.3 | 67.0 | 70.7 | - | 1681/502 | - | - | 78.9 | 65.7 | - | - |
|
56 |
+
| Qwen−VL−Max\* | unknown | 51.4 | 46.8 | 51.0 | 77.6 | 75.7 | - | - | - | - | 79.5 | - | - | - |
|
57 |
+
| | | | | | | | | | | | | | | |
|
58 |
+
| LLaVA−NEXT−34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1 |
|
59 |
+
| InternVL−Chat−V1.2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1687/489 | 83.3 | 88.0 | 72.5 | 75.6 | 60.0 | 64.0 |
|
60 |
+
|
|
|
61 |
- In most benchmarks, InternVL-Chat-V1.2 achieves better performance than LLaVA-NeXT-34B.
|
62 |
+
- Update (2024-04-21): We have fixed a bug in the evaluation code, and the TextVQA result has been corrected to 72.5.
|
63 |
|
64 |
+
### Training (Supervised Finetuning)
|
65 |
|
66 |
We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
|
67 |
|
|
|
75 |
|
76 |
|
77 |
## Model Details
|
78 |
+
- **Model Type:** multimodal large language model (MLLM)
|
79 |
- **Model Stats:**
|
80 |
- Architecture: [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2) + MLP + [Nous-Hermes-2-Yi-34B](https://huggingface.co/NousResearch/Nous-Hermes-2-Yi-34B)
|
81 |
+
- Image size: 448 x 448 (256 tokens)
|
82 |
- Params: 40B
|
|
|
|
|
83 |
|
84 |
- **Training Strategy:**
|
85 |
- Pretraining Stage
|
86 |
- Learnable Component: MLP
|
87 |
- Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
88 |
- Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
89 |
+
- Supervised Finetuning Stage
|
90 |
- Learnable Component: ViT + MLP + LLM
|
91 |
+
- Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples.
|
92 |
|
93 |
|
94 |
## Model Usage
|
95 |
|
96 |
+
We provide an example code to run InternVL-Chat-V1.2 using `transformers`.
|
97 |
|
98 |
You also can use our [online demo](https://internvl.opengvlab.com/) for a quick experience of this model.
|
99 |
|
|
|
|
|
|
|
100 |
```python
|
101 |
import torch
|
102 |
from PIL import Image
|
|
|
147 |
print(question, response)
|
148 |
```
|
149 |
|
|
|
150 |
## Citation
|
151 |
|
152 |
If you find this project useful in your research, please consider citing:
|