Upload folder using huggingface_hub
Browse files
README.md
CHANGED
@@ -1,6 +1,20 @@
|
|
1 |
---
|
2 |
license: mit
|
3 |
pipeline_tag: image-text-to-text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4 |
---
|
5 |
|
6 |
# InternVL-Chat-V1-2
|
@@ -31,13 +45,13 @@ For better training reproducibility, we follow the minimalist design and data ef
|
|
31 |
|
32 |
- **Training Strategy:**
|
33 |
|
34 |
-
-
|
35 |
- Learnable Component: ViT + MLP
|
36 |
-
- Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR
|
37 |
-
- Note: In this stage, we load the
|
38 |
-
- Supervised
|
39 |
- Learnable Component: ViT + MLP + LLM
|
40 |
-
- Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples.
|
41 |
|
42 |
## Performance
|
43 |
|
@@ -54,7 +68,6 @@ For better training reproducibility, we follow the minimalist design and data ef
|
|
54 |
| LLaVA−NEXT−34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1 |
|
55 |
| InternVL−Chat<br>−V1-2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1687/489 | 83.3 | 88.0 | 72.5 | 75.6 | 60.0 | 64.0 |
|
56 |
|
57 |
-
- Note that we use the [official evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator) to test the MMVet scores, with `GPT-4-0613` serving as the judge model. Using different versions of GPT-4 as the judge can result in significant score variations.
|
58 |
- In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
|
59 |
|
60 |
Here, we have conducted only a simple performance comparison. For more detailed performance information and additional evaluation metrics, please refer to our performance summary table.
|
@@ -65,15 +78,15 @@ Here, we have conducted only a simple performance comparison. For more detailed
|
|
65 |
|
66 |
Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1-2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
|
67 |
|
68 |
-
For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
|
69 |
|
70 |
-
### Training (Supervised
|
71 |
|
72 |
-
We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/
|
73 |
|
74 |
-
For more details about training, please see [here](https://
|
75 |
|
76 |
-
The hyperparameters used for
|
77 |
|
78 |
| Hyperparameter | Trainable Param | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
|
79 |
| ---------------------- | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
|
|
|
1 |
---
|
2 |
license: mit
|
3 |
pipeline_tag: image-text-to-text
|
4 |
+
library_name: transformers
|
5 |
+
base_model:
|
6 |
+
- OpenGVLab/InternViT-6B-448px-V1-2
|
7 |
+
- NousResearch/Nous-Hermes-2-Yi-34B
|
8 |
+
base_model_relation: merge
|
9 |
+
language:
|
10 |
+
- multilingual
|
11 |
+
tags:
|
12 |
+
- internvl
|
13 |
+
- vision
|
14 |
+
- ocr
|
15 |
+
- multi-image
|
16 |
+
- video
|
17 |
+
- custom_code
|
18 |
---
|
19 |
|
20 |
# InternVL-Chat-V1-2
|
|
|
45 |
|
46 |
- **Training Strategy:**
|
47 |
|
48 |
+
- Pre-training Stage
|
49 |
- Learnable Component: ViT + MLP
|
50 |
+
- Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
|
51 |
+
- Note: In this stage, we first load the pre-trained weights of [InternViT-6B-448px-V1-0](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) and connect it to Nous-Hermes-2-Yi-34B. After pre-training, the extracted ViT is published as [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
|
52 |
+
- Supervised Fine-tuning Stage
|
53 |
- Learnable Component: ViT + MLP + LLM
|
54 |
+
- Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples. You can download it from [here](https://huggingface.co/datasets/OpenGVLab/InternVL-Chat-V1-2-SFT-Data).
|
55 |
|
56 |
## Performance
|
57 |
|
|
|
68 |
| LLaVA−NEXT−34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1 |
|
69 |
| InternVL−Chat<br>−V1-2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1687/489 | 83.3 | 88.0 | 72.5 | 75.6 | 60.0 | 64.0 |
|
70 |
|
|
|
71 |
- In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
|
72 |
|
73 |
Here, we have conducted only a simple performance comparison. For more detailed performance information and additional evaluation metrics, please refer to our performance summary table.
|
|
|
78 |
|
79 |
Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1-2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
|
80 |
|
81 |
+
Now, you can download these datasets directly from [HuggingFace](https://huggingface.co/datasets/OpenGVLab/InternVL-Chat-V1-2-SFT-Data). For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
|
82 |
|
83 |
+
### Training (Supervised Fine-tuning)
|
84 |
|
85 |
+
We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internvl1.2/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_res_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
|
86 |
|
87 |
+
For more details about training, please see [here](https://internvl.readthedocs.io/en/latest/internvl1.2/reproduce.html).
|
88 |
|
89 |
+
The hyperparameters used for fine-tuning are listed in the following table.
|
90 |
|
91 |
| Hyperparameter | Trainable Param | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
|
92 |
| ---------------------- | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
|