## Data preparation ### data for training - The images pretraining dataset is from [LLaVA](https://github.com/haotian-liu/LLaVA). - The images tuning dataset is from [LLaVA](https://github.com/haotian-liu/LLaVA). - The videos pretraining dataset is from [Valley](https://github.com/RupertLuo/Valley). - The videos tuning dataset is from [Video-ChatGPT](https://github.com/mbzuai-oryx/Video-ChatGPT). - Download the training annotations. You can download from [Baidu Disk](https://pan.baidu.com/s/1BipI3_f--GRWqaWTGYp-Jg?pwd=wkl0), [Google Disk](https://drive.google.com/file/d/11-1NBXNeiNQE2wPbue1dFph_Na_EHRYG/view?usp=drive_link) or [Peking University Disk](https://disk.pku.edu.cn:443/link/84783AB54553DFA150C1C5E82C16EB29) We also provide the processed data as follows.
DatasetsBaidu Disk
Image pretrainingLink
Image tuningLink
Video pretrainingLink
Video tuningLink
After downloading all of them, organize the data as follows in ```DATA_ROOT```. ```Shell DATA_ROOT ├── llava_image ├── llava_image_tune ├── valley └── videochatgpt_tune ``` ### data for validating - For image, follow LLaVA's instructions. ***You MUST first download [eval.zip](https://drive.google.com/file/d/1atZSBBrAX54yYpxtVVW33zFvcnaHeFPy/view?usp=sharing)**. It contains custom annotations, scripts, and the prediction files with LLaVA v1.5. Extract to `eval`. This also provides a general structure for all datasets.* - For video, videos and annotations can be downloaded from Video-ChatGPT. We also provide the processed data as follows.
DatasetsBaidu DiskGoogle DiskPeking University Disk
Activitynet_Zero_Shot_QALink--
MSRVTT_Zero_Shot_QALinkLink-
MSVD_Zero_Shot_QALinkLinkLink
TGIF_Zero_Shot_QALinkLinkLink
After downloading all of them, organize the data as follows in `eval`. ```Shell eval ├── GPT_Zero_Shot_QA │   ├── Activitynet_Zero_Shot_QA │   ├── MSRVTT_Zero_Shot_QA │   ├── MSVD_Zero_Shot_QA │   └── TGIF_Zero_Shot_QA ├── gqa │   ├── answers │   ├── data │   └── llava_gqa_testdev_balanced.jsonl ├── llava-bench-in-the-wild │   ├── answers │   ├── answers_gpt4.jsonl │   ├── bard_0718.jsonl │   ├── bing_chat_0629.jsonl │   ├── context.jsonl │   ├── images │   ├── questions.jsonl │   ├── README.md │   └── reviews ├── mmbench │   ├── answers │   ├── answers_upload │   ├── mmbench_dev_20230712.tsv │   └── mmbench_dev_en_20231003.tsv ├── MME │   ├── answers │   ├── convert_answer_to_mme.py │   └── llava_mme.jsonl ├── mm-vet │   ├── answers │   ├── bard_set.json │   ├── convert_answers.py │   ├── images │   ├── llava-mm-vet.jsonl │   ├── mm-vet.json │   └── results ├── pope │   ├── answers │   ├── coco │   ├── llava_pope_test.jsonl │   └── val2014 ├── scienceqa │   ├── answers │   ├── images │   ├── llava_test_CQM-A.json │   ├── pid_splits.json │   └── problems.json ├── seed_bench │   ├── answers │   ├── answers_upload │   ├── extract_video_frames.py │   └── llava-seed-bench.jsonl ├── textvqa │   ├── answers │   ├── llava_textvqa_val_v051_ocr.jsonl │   ├── TextVQA_0.5.1_val.json │   └── train_images ├── vizwiz │   ├── answers │   ├── answers_upload │   ├── llava_test.jsonl │   ├── test │   ├── test.json │   ├── train.json │   └── val.json └── vqav2 ├── answers ├── answers_upload ├── llava_vqav2_mscoco_test2015.jsonl ├── llava_vqav2_mscoco_test-dev2015.jsonl └── test2015 ``` ## Training Specify your `DATA_ROOT` according to the data preparation. - Stage 1 pretraining script: [pretrain.sh](scripts/v1_5/pretrain.sh). - Stage 2 tuning script: [finetune.sh](scripts/v1_5/finetune.sh). ## Validating Our image validation code comes from LLaVA and our video validation code comes from Video-ChatGPT, thanks for their contribution! You can refer to the official repository for validation, but we also provide [off-the-shelf](scripts/v1_5/eval) scripts. ### MSRVTT-QA 1. Inference to get the result. ```Shell CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msrvtt.sh ``` 2. GPT-Assistant evaluation. ```Shell bash scripts/v1_5/eval/eval_qa_msrvtt.sh ``` ### MSVD-QA 1. Inference to get the result. ```Shell CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_msvd.sh ``` 2. GPT-Assistant evaluation. ```Shell bash scripts/v1_5/eval/eval_qa_msvd.sh ``` ### TGIF-QA 1. Inference to get the result. ```Shell CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_tgif.sh ``` 2. GPT-Assistant evaluation. ```Shell bash scripts/v1_5/eval/eval_qa_tgif.sh ``` ### ActivityNet-QA 1. Inference to get the result. ```Shell CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/run_qa_activitynet.sh ``` 2. GPT-Assistant evaluation. ```Shell bash scripts/v1_5/eval/eval_qa_activitynet.sh ``` ### VQAv2 1. Download [`test2015`](http://images.cocodataset.org/zips/test2015.zip) and put it under `eval/vqav2`. 2. Multi-GPU inference. ```Shell CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/eval_image_vqav2.sh ``` 3. Submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/830/my-submission): `eval/vqav2/answers_upload`. ### GQA 1. Download the data following the official instructions [here](https://cs.stanford.edu/people/dorarad/gqa/download.html) and put under `eval/gqa/data`. 2. Multi-GPU inference. ```Shell CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash scripts/v1_5/eval/eval_image_gqa.sh ``` ### VisWiz 1. Download [`test.json`](https://vizwiz.cs.colorado.edu/VizWiz_final/vqa_data/Annotations.zip) and extract [`test.zip`](https://vizwiz.cs.colorado.edu/VizWiz_final/images/test.zip) to `test`. Put them under `eval/vizwiz`. 2. Single-GPU inference. ```Shell CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_vizwiz.sh ``` 3. Submit the results to the [evaluation server](https://eval.ai/web/challenges/challenge-page/1911/my-submission): `eval/vizwiz/answers_upload`. ### ScienceQA 1. Under `eval/scienceqa`, download `images`, `pid_splits.json`, `problems.json` from the `data/scienceqa` folder of the ScienceQA [repo](https://github.com/lupantech/ScienceQA). 2. Single-GPU inference and evaluate. ```Shell CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_sqa.sh ``` ### TextVQA 1. Download [`TextVQA_0.5.1_val.json`](https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json) and [images](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip) and extract to `eval/textvqa`. 2. Single-GPU inference and evaluate. ```Shell CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_textvqa.sh ``` ### POPE 1. Download `coco` from [POPE](https://github.com/AoiDragon/POPE/tree/e3e39262c85a6a83f26cf5094022a782cb0df58d/output/coco) and put under `eval/pope`. 2. Single-GPU inference and evaluate. ```Shell CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_pope.sh ``` ### MMBench 1. Download [`mmbench_dev_20230712.tsv`](https://download.openmmlab.com/mmclassification/datasets/mmbench/mmbench_dev_20230712.tsv) and put under `eval/mmbench`. 2. Single-GPU inference. ```Shell CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_mmbench.sh ``` 3. Submit the results to the [evaluation server](https://opencompass.org.cn/leaderboard-multimodal): `eval/mmbench/answers_upload/mmbench_dev_20230712`. ### LLaVA-Bench-in-the-Wild 1. Extract contents of [`llava-bench-in-the-wild`](https://huggingface.co/datasets/liuhaotian/llava-bench-in-the-wild) to `eval/llava-bench-in-the-wild`. 2. Single-GPU inference and evaluate. ```Shell CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_llavabench.sh ``` ### MM-Vet 1. Extract [`mm-vet.zip`](https://github.com/yuweihao/MM-Vet/releases/download/v1/mm-vet.zip) to `eval/mmvet`. 2. Single-GPU inference. ```Shell CUDA_VISIBLE_DEVICES=0 bash scripts/v1_5/eval/eval_image_mmvet.sh ```