OpenGVLab
/

InternVL-Chat-V1-2

Image-Text-to-Text

feature-extraction

Model card Files Files and versions Metrics Training metrics Community

czczup commited on Apr 20, 2024

Commit

27004c6

·

verified ·

1 Parent(s): 62d21d2

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -84,7 +84,7 @@ The hyperparameters used for finetuning are listed in the following table.
 - **Training Strategy:**
   - Pretraining Stage
     - Learnable Component: MLP
-    - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR data.
     - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
   - Supervised Finetuning Stage
     - Learnable Component: ViT + MLP + LLM

 - **Training Strategy:**
   - Pretraining Stage
     - Learnable Component: MLP
+    - Data: Trained on 8192x4800=39.3M samples, including COYO, LAION, CC12M, CC3M, SBU, Wukong, GRIT, Objects365, OpenImages, and OCR-related datasets.
     - Note: In this stage, we load the pretrained weights of [InternViT-6B-448px-V1-2](https://huggingface.co/OpenGVLab/InternViT-6B-448px-V1-2). Moreover, in order to reduce the number of visual tokens, we use a pixel shuffle to reduce 1024 tokens to 256 tokens.
   - Supervised Finetuning Stage
     - Learnable Component: ViT + MLP + LLM