wjpoom commited on
Commit
477a2cb
·
verified ·
1 Parent(s): 07b9dc4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -18
README.md CHANGED
@@ -176,33 +176,39 @@ Our code is based on LLaVA-NeXT, before running, please install the LLaVA-NeXT t
176
  pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
177
  ```
178
  **Error Handling**
 
 
 
 
 
 
 
 
 
 
 
179
 
180
- You might encounter an error when loading checkpoint from the local disk:
181
- ```shell
182
  RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:
183
  size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([729, 1152]) from checkpoint, the shape in current model is torch.Size([730, 1152]).
184
  You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
185
  ```
186
- If you meet this error, you can fix this error following the guidelines as below:
187
-
188
- <details>
189
- <summary>Error handling guideline</summary>
190
-
191
- This is a logical error encountered when loading the vision tower from the local path. To fix this issue, you can prepare the environment in any of the following ways.
192
 
193
  **Option 1: Install from our fork of LLaVA-NeXT:**
194
 
195
- ```shell
196
  pip install git+https://github.com/inst-it/LLaVA-NeXT.git
197
  ```
198
 
199
  **Option 2: Install from LLaVA-NeXT and manually modify its code:**
200
  * step 1: clone source code
201
- ```shell
202
  git clone https://github.com/LLaVA-VL/LLaVA-NeXT.git
203
  ```
204
  * step 2: before installing LLaVA-NeXT, you need to modify `line 17` of [llava/model/multimodal_encoder/builder.py](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/model/multimodal_encoder/builder.py#L17).
205
- ```python
206
  # Before modification:
207
  if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
208
 
@@ -210,7 +216,7 @@ if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.
210
  if "clip" in vision_tower or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
211
  ```
212
  * step 3: install LLaVA-NeXT from source:
213
- ```shell
214
  cd LLaVA-NeXT
215
  pip install --upgrade pip # Enable PEP 660 support.
216
  pip install -e ".[train]"
@@ -219,8 +225,10 @@ pip install -e ".[train]"
219
  We recommend the first way because it is simple.
220
  </details>
221
 
 
 
222
  **Load Model**
223
- ```python
224
  from llava.model.builder import load_pretrained_model
225
  from llava.constants import DEFAULT_IMAGE_TOKEN
226
 
@@ -258,7 +266,7 @@ tokenizer, model, image_processor, max_length = load_pretrained_model(
258
 
259
  Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
260
 
261
- ```python
262
  import torch
263
  import requests
264
  from PIL import Image
@@ -311,7 +319,7 @@ You can refer to the instances that you are interested in using their IDs.
311
  Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
312
  Refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
313
 
314
- ```python
315
  import torch
316
  import requests
317
  from PIL import Image
@@ -366,7 +374,7 @@ For the video, we organize each frame into a list. You can use the format \<t\>
366
 
367
  Our model can perform inference on videos without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
368
 
369
- ```python
370
  import torch
371
  import requests
372
  from PIL import Image
@@ -429,7 +437,7 @@ You can refer to the instances that you are interested in using their IDs.
429
  Compared to the previous inference code, the following code has no modifications except for the input video, which is visual prompted with Set-of-Marks.
430
  Refer to [SAM2](https://github.com/facebookresearch/sam2) and [SoM](https://github.com/microsoft/SoM) to learn how to generate SoMs for a video.
431
 
432
- ```python
433
  import torch
434
  import requests
435
  from PIL import Image
@@ -490,7 +498,7 @@ Feel free to contact us if you have any questions or suggestions
490
  - Email (Lingchen Meng): [email protected]
491
 
492
  ## Citation
493
- ```bibtex
494
  @article{peng2024inst,
495
  title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
496
  author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Xu, Hang and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
 
176
  pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
177
  ```
178
  **Error Handling**
179
+ <details>
180
+ <summary>Click to unfold</summary>
181
+
182
+ * **Common error case 1:**
183
+ ``` shell
184
+ Exception: data did not match any variant of untagged enum ModelWrapper at line 757272 column 3
185
+ ```
186
+ This is caused by the version of `transformers`, try to update it:
187
+ ``` python
188
+ pip install -U transformers
189
+ ```
190
 
191
+ * **Common error case 2:**
192
+ ``` shell
193
  RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:
194
  size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([729, 1152]) from checkpoint, the shape in current model is torch.Size([730, 1152]).
195
  You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
196
  ```
197
+ This is a logical error encountered when loading the vision tower from the local path. To fix this issue, you can prepare the environment in any of the following ways.
 
 
 
 
 
198
 
199
  **Option 1: Install from our fork of LLaVA-NeXT:**
200
 
201
+ ``` shell
202
  pip install git+https://github.com/inst-it/LLaVA-NeXT.git
203
  ```
204
 
205
  **Option 2: Install from LLaVA-NeXT and manually modify its code:**
206
  * step 1: clone source code
207
+ ``` shell
208
  git clone https://github.com/LLaVA-VL/LLaVA-NeXT.git
209
  ```
210
  * step 2: before installing LLaVA-NeXT, you need to modify `line 17` of [llava/model/multimodal_encoder/builder.py](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/model/multimodal_encoder/builder.py#L17).
211
+ ``` python
212
  # Before modification:
213
  if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
214
 
 
216
  if "clip" in vision_tower or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
217
  ```
218
  * step 3: install LLaVA-NeXT from source:
219
+ ``` shell
220
  cd LLaVA-NeXT
221
  pip install --upgrade pip # Enable PEP 660 support.
222
  pip install -e ".[train]"
 
225
  We recommend the first way because it is simple.
226
  </details>
227
 
228
+ </details>
229
+
230
  **Load Model**
231
+ ``` python
232
  from llava.model.builder import load_pretrained_model
233
  from llava.constants import DEFAULT_IMAGE_TOKEN
234
 
 
266
 
267
  Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
268
 
269
+ ``` python
270
  import torch
271
  import requests
272
  from PIL import Image
 
319
  Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
320
  Refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
321
 
322
+ ``` python
323
  import torch
324
  import requests
325
  from PIL import Image
 
374
 
375
  Our model can perform inference on videos without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
376
 
377
+ ``` python
378
  import torch
379
  import requests
380
  from PIL import Image
 
437
  Compared to the previous inference code, the following code has no modifications except for the input video, which is visual prompted with Set-of-Marks.
438
  Refer to [SAM2](https://github.com/facebookresearch/sam2) and [SoM](https://github.com/microsoft/SoM) to learn how to generate SoMs for a video.
439
 
440
+ ``` python
441
  import torch
442
  import requests
443
  from PIL import Image
 
498
  - Email (Lingchen Meng): [email protected]
499
 
500
  ## Citation
501
+ ``` bibtex
502
  @article{peng2024inst,
503
  title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
504
  author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Xu, Hang and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},