Inst-IT
/

LLaVA-Next-Inst-It-Qwen2-7B

@@ -176,33 +176,39 @@ Our code is based on LLaVA-NeXT, before running, please install the LLaVA-NeXT t
 pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
 ```
 **Error Handling**
-You might encounter an error when loading checkpoint from the local disk:
-```shell
 RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:
 	size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([729, 1152]) from checkpoint, the shape in current model is torch.Size([730, 1152]).
 	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
 ```
-If you meet this error, you can fix this error following the guidelines as below:
-<details>
-<summary>Error handling guideline</summary>
- This is a logical error encountered when loading the vision tower from the local path. To fix this issue, you can prepare the environment in any of the following ways.
 **Option 1: Install from our fork of LLaVA-NeXT:**
-```shell
 pip install git+https://github.com/inst-it/LLaVA-NeXT.git
 ```
 **Option 2: Install from LLaVA-NeXT and manually modify its code:**
 * step 1: clone source code
-```shell
 git clone https://github.com/LLaVA-VL/LLaVA-NeXT.git
 ```
 * step 2: before installing LLaVA-NeXT, you need to modify `line 17` of [llava/model/multimodal_encoder/builder.py](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/model/multimodal_encoder/builder.py#L17).
-```python
 # Before modification:
 if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
@@ -210,7 +216,7 @@ if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.
 if "clip" in vision_tower or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
 ```
 * step 3: install LLaVA-NeXT from source:
-```shell
 cd LLaVA-NeXT
 pip install --upgrade pip  # Enable PEP 660 support.
 pip install -e ".[train]"
@@ -219,8 +225,10 @@ pip install -e ".[train]"
 We recommend the first way because it is simple.
 </details>
 **Load Model**
-```python
 from llava.model.builder import load_pretrained_model
 from llava.constants import DEFAULT_IMAGE_TOKEN
@@ -258,7 +266,7 @@ tokenizer, model, image_processor, max_length = load_pretrained_model(
 Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
-```python
 import torch
 import requests
 from PIL import Image
@@ -311,7 +319,7 @@ You can refer to the instances that you are interested in using their IDs.
 Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
 Refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
-```python
 import torch
 import requests
 from PIL import Image
@@ -366,7 +374,7 @@ For the video, we organize each frame into a list. You can use the format \<t\>
 Our model can perform inference on videos without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
-```python
 import torch
 import requests
 from PIL import Image
@@ -429,7 +437,7 @@ You can refer to the instances that you are interested in using their IDs.
 Compared to the previous inference code, the following code has no modifications except for the input video, which is visual prompted with Set-of-Marks.
 Refer to [SAM2](https://github.com/facebookresearch/sam2) and [SoM](https://github.com/microsoft/SoM) to learn how to generate SoMs for a video.
-```python
 import torch
 import requests
 from PIL import Image
@@ -490,7 +498,7 @@ Feel free to contact us if you have any questions or suggestions
 - Email (Lingchen Meng): [email protected]
 ## Citation
-```bibtex
 @article{peng2024inst,
   title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
   author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Xu, Hang and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},

 pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
 ```
 **Error Handling**
+<details>
+<summary>Click to unfold</summary>
+* **Common error case 1:**
+``` shell
+  Exception: data did not match any variant of untagged enum ModelWrapper at line 757272 column 3
+```
+This is caused by the version of `transformers`, try to update it:
+``` python
+pip install -U transformers
+```
+* **Common error case 2:**
+``` shell
 RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:
 	size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([729, 1152]) from checkpoint, the shape in current model is torch.Size([730, 1152]).
 	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
 ```
+This is a logical error encountered when loading the vision tower from the local path. To fix this issue, you can prepare the environment in any of the following ways.
 **Option 1: Install from our fork of LLaVA-NeXT:**
+``` shell
 pip install git+https://github.com/inst-it/LLaVA-NeXT.git
 ```
 **Option 2: Install from LLaVA-NeXT and manually modify its code:**
 * step 1: clone source code
+``` shell
 git clone https://github.com/LLaVA-VL/LLaVA-NeXT.git
 ```
 * step 2: before installing LLaVA-NeXT, you need to modify `line 17` of [llava/model/multimodal_encoder/builder.py](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/model/multimodal_encoder/builder.py#L17).
+``` python
 # Before modification:
 if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
 if "clip" in vision_tower or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
 ```
 * step 3: install LLaVA-NeXT from source:
+``` shell
 cd LLaVA-NeXT
 pip install --upgrade pip  # Enable PEP 660 support.
 pip install -e ".[train]"
 We recommend the first way because it is simple.
 </details>
+</details>
 **Load Model**
+``` python
 from llava.model.builder import load_pretrained_model
 from llava.constants import DEFAULT_IMAGE_TOKEN
 Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
+``` python
 import torch
 import requests
 from PIL import Image
 Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
 Refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
+``` python
 import torch
 import requests
 from PIL import Image
 Our model can perform inference on videos without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
+``` python
 import torch
 import requests
 from PIL import Image
 Compared to the previous inference code, the following code has no modifications except for the input video, which is visual prompted with Set-of-Marks.
 Refer to [SAM2](https://github.com/facebookresearch/sam2) and [SoM](https://github.com/microsoft/SoM) to learn how to generate SoMs for a video.
+``` python
 import torch
 import requests
 from PIL import Image
 - Email (Lingchen Meng): [email protected]
 ## Citation
+``` bibtex
 @article{peng2024inst,
   title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
   author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Xu, Hang and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},