czczup commited on
Commit
ce69c4e
·
verified ·
1 Parent(s): 6f26090

Upload folder using huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +24 -11
README.md CHANGED
@@ -1,6 +1,20 @@
1
  ---
2
  license: mit
3
  pipeline_tag: image-text-to-text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
  # InternVL-Chat-V1-2
@@ -31,13 +45,13 @@ For better training reproducibility, we follow the minimalist design and data ef
31
 
32
  - **Training Strategy:**
33
 
34
- - Pretraining Stage
35
  - Learnable Component: ViT + MLP
36
- - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR-related datasets.
37
- - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
38
- - Supervised Finetuning Stage
39
  - Learnable Component: ViT + MLP + LLM
40
- - Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples.
41
 
42
  ## Performance
43
 
@@ -54,7 +68,6 @@ For better training reproducibility, we follow the minimalist design and data ef
54
  | LLaVA−NEXT−34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1 |
55
  | InternVL−Chat<br>−V1-2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1687/489 | 83.3 | 88.0 | 72.5 | 75.6 | 60.0 | 64.0 |
56
 
57
- - Note that we use the [official evaluation server](https://huggingface.co/spaces/whyu/MM-Vet_Evaluator) to test the MMVet scores, with `GPT-4-0613` serving as the judge model. Using different versions of GPT-4 as the judge can result in significant score variations.
58
  - In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
59
 
60
  Here, we have conducted only a simple performance comparison. For more detailed performance information and additional evaluation metrics, please refer to our performance summary table.
@@ -65,15 +78,15 @@ Here, we have conducted only a simple performance comparison. For more detailed
65
 
66
  Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1-2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
67
 
68
- For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
69
 
70
- ### Training (Supervised Finetuning)
71
 
72
- We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat/shell/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
73
 
74
- For more details about training, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#start-training).
75
 
76
- The hyperparameters used for finetuning are listed in the following table.
77
 
78
  | Hyperparameter | Trainable Param | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
79
  | ---------------------- | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |
 
1
  ---
2
  license: mit
3
  pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ base_model:
6
+ - OpenGVLab/InternViT-6B-448px-V1-2
7
+ - NousResearch/Nous-Hermes-2-Yi-34B
8
+ base_model_relation: merge
9
+ language:
10
+ - multilingual
11
+ tags:
12
+ - internvl
13
+ - vision
14
+ - ocr
15
+ - multi-image
16
+ - video
17
+ - custom_code
18
  ---
19
 
20
  # InternVL-Chat-V1-2
 
45
 
46
  - **Training Strategy:**
47
 
48
+ - Pre-training Stage
49
  - Learnable Component: ViT + MLP
50
+ - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
51
+ - Note: In this stage, we first load the pre-trained weights of [InternViT-6B-448px-V1-0](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-0) and connect it to Nous-Hermes-2-Yi-34B. After pre-training, the extracted ViT is published as [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
52
+ - Supervised Fine-tuning Stage
53
  - Learnable Component: ViT + MLP + LLM
54
+ - Data: A simplified, fully open-source dataset, containing approximately 1.2 million samples. You can download it from [here](https://huggingface.co/datasets/OpenGVLab/InternVL-Chat-V1-2-SFT-Data).
55
 
56
  ## Performance
57
 
 
68
  | LLaVA−NEXT−34B | 672x672 | 51.1 | 44.7 | 46.5 | 79.3 | 79.0 | - | 1631/397 | 81.8 | 87.7 | 69.5 | 75.9 | 63.8 | 67.1 |
69
  | InternVL−Chat<br>−V1-2 | 448x448 | 51.6 | 46.2 | 47.7 | 82.2 | 81.2 | 56.7 | 1687/489 | 83.3 | 88.0 | 72.5 | 75.6 | 60.0 | 64.0 |
70
 
 
71
  - In most benchmarks, InternVL-Chat-V1-2 achieves better performance than LLaVA-NeXT-34B.
72
 
73
  Here, we have conducted only a simple performance comparison. For more detailed performance information and additional evaluation metrics, please refer to our performance summary table.
 
78
 
79
  Inspired by LLaVA-NeXT, we adopted a data-efficient SFT strategy to train InternVL-Chat-V1-2, utilizing approximately 1.2M of visual instruction tuning samples in total, all of which are fully open-source. In a macro sense, we build upon [ShareGPT-4V](https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md#prepare-images) and additionally integrate [LLaVA-ZH](https://huggingface.co/datasets/openbmb/llava_zh), [DVQA](https://github.com/kushalkafle/DVQA_dataset), [ChartQA](https://github.com/vis-nlp/ChartQA), [AI2D](https://allenai.org/data/diagrams), [DocVQA](https://www.docvqa.org/datasets), [GeoQA+](https://github.com/SCNU203/GeoQA-Plus), and [SynthDoG-EN](https://huggingface.co/datasets/naver-clova-ix/synthdog-en). Most of the data remains consistent with LLaVA-NeXT.
80
 
81
+ Now, you can download these datasets directly from [HuggingFace](https://huggingface.co/datasets/OpenGVLab/InternVL-Chat-V1-2-SFT-Data). For more details about data preparation, please see [here](https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets).
82
 
83
+ ### Training (Supervised Fine-tuning)
84
 
85
+ We provide [slurm scripts](https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/shell/internvl1.2/hermes2_yi34b/internvl_chat_v1_2_hermes2_yi34b_448_res_finetune.sh) for multi-node multi-GPU training. You can use either 32 or 64 GPUs to train this model. If you use 64 GPUs, training will take approximately 18 hours.
86
 
87
+ For more details about training, please see [here](https://internvl.readthedocs.io/en/latest/internvl1.2/reproduce.html).
88
 
89
+ The hyperparameters used for fine-tuning are listed in the following table.
90
 
91
  | Hyperparameter | Trainable Param | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |
92
  | ---------------------- | ---------------- | ----------------- | ------------- | ------ | ---------- | ------------ |