Update README.md
Browse files
README.md
CHANGED
@@ -176,33 +176,39 @@ Our code is based on LLaVA-NeXT, before running, please install the LLaVA-NeXT t
|
|
176 |
pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
|
177 |
```
|
178 |
**Error Handling**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
179 |
|
180 |
-
|
181 |
-
```shell
|
182 |
RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:
|
183 |
size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([729, 1152]) from checkpoint, the shape in current model is torch.Size([730, 1152]).
|
184 |
You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
|
185 |
```
|
186 |
-
|
187 |
-
|
188 |
-
<details>
|
189 |
-
<summary>Error handling guideline</summary>
|
190 |
-
|
191 |
-
This is a logical error encountered when loading the vision tower from the local path. To fix this issue, you can prepare the environment in any of the following ways.
|
192 |
|
193 |
**Option 1: Install from our fork of LLaVA-NeXT:**
|
194 |
|
195 |
-
```shell
|
196 |
pip install git+https://github.com/inst-it/LLaVA-NeXT.git
|
197 |
```
|
198 |
|
199 |
**Option 2: Install from LLaVA-NeXT and manually modify its code:**
|
200 |
* step 1: clone source code
|
201 |
-
```shell
|
202 |
git clone https://github.com/LLaVA-VL/LLaVA-NeXT.git
|
203 |
```
|
204 |
* step 2: before installing LLaVA-NeXT, you need to modify `line 17` of [llava/model/multimodal_encoder/builder.py](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/model/multimodal_encoder/builder.py#L17).
|
205 |
-
```python
|
206 |
# Before modification:
|
207 |
if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
|
208 |
|
@@ -210,7 +216,7 @@ if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.
|
|
210 |
if "clip" in vision_tower or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
|
211 |
```
|
212 |
* step 3: install LLaVA-NeXT from source:
|
213 |
-
```shell
|
214 |
cd LLaVA-NeXT
|
215 |
pip install --upgrade pip # Enable PEP 660 support.
|
216 |
pip install -e ".[train]"
|
@@ -219,8 +225,10 @@ pip install -e ".[train]"
|
|
219 |
We recommend the first way because it is simple.
|
220 |
</details>
|
221 |
|
|
|
|
|
222 |
**Load Model**
|
223 |
-
```python
|
224 |
from llava.model.builder import load_pretrained_model
|
225 |
from llava.constants import DEFAULT_IMAGE_TOKEN
|
226 |
|
@@ -258,7 +266,7 @@ tokenizer, model, image_processor, max_length = load_pretrained_model(
|
|
258 |
|
259 |
Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
|
260 |
|
261 |
-
```python
|
262 |
import torch
|
263 |
import requests
|
264 |
from PIL import Image
|
@@ -311,7 +319,7 @@ You can refer to the instances that you are interested in using their IDs.
|
|
311 |
Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
|
312 |
Refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
|
313 |
|
314 |
-
```python
|
315 |
import torch
|
316 |
import requests
|
317 |
from PIL import Image
|
@@ -366,7 +374,7 @@ For the video, we organize each frame into a list. You can use the format \<t\>
|
|
366 |
|
367 |
Our model can perform inference on videos without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
|
368 |
|
369 |
-
```python
|
370 |
import torch
|
371 |
import requests
|
372 |
from PIL import Image
|
@@ -429,7 +437,7 @@ You can refer to the instances that you are interested in using their IDs.
|
|
429 |
Compared to the previous inference code, the following code has no modifications except for the input video, which is visual prompted with Set-of-Marks.
|
430 |
Refer to [SAM2](https://github.com/facebookresearch/sam2) and [SoM](https://github.com/microsoft/SoM) to learn how to generate SoMs for a video.
|
431 |
|
432 |
-
```python
|
433 |
import torch
|
434 |
import requests
|
435 |
from PIL import Image
|
@@ -490,7 +498,7 @@ Feel free to contact us if you have any questions or suggestions
|
|
490 |
- Email (Lingchen Meng): [email protected]
|
491 |
|
492 |
## Citation
|
493 |
-
```bibtex
|
494 |
@article{peng2024inst,
|
495 |
title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
|
496 |
author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Xu, Hang and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
|
|
|
176 |
pip install git+https://github.com/LLaVA-VL/LLaVA-NeXT.git
|
177 |
```
|
178 |
**Error Handling**
|
179 |
+
<details>
|
180 |
+
<summary>Click to unfold</summary>
|
181 |
+
|
182 |
+
* **Common error case 1:**
|
183 |
+
``` shell
|
184 |
+
Exception: data did not match any variant of untagged enum ModelWrapper at line 757272 column 3
|
185 |
+
```
|
186 |
+
This is caused by the version of `transformers`, try to update it:
|
187 |
+
``` python
|
188 |
+
pip install -U transformers
|
189 |
+
```
|
190 |
|
191 |
+
* **Common error case 2:**
|
192 |
+
``` shell
|
193 |
RuntimeError: Error(s) in loading state_dict for CLIPVisionModel:
|
194 |
size mismatch for vision_model.embeddings.position_embedding.weight: copying a param with shape torch.Size([729, 1152]) from checkpoint, the shape in current model is torch.Size([730, 1152]).
|
195 |
You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
|
196 |
```
|
197 |
+
This is a logical error encountered when loading the vision tower from the local path. To fix this issue, you can prepare the environment in any of the following ways.
|
|
|
|
|
|
|
|
|
|
|
198 |
|
199 |
**Option 1: Install from our fork of LLaVA-NeXT:**
|
200 |
|
201 |
+
``` shell
|
202 |
pip install git+https://github.com/inst-it/LLaVA-NeXT.git
|
203 |
```
|
204 |
|
205 |
**Option 2: Install from LLaVA-NeXT and manually modify its code:**
|
206 |
* step 1: clone source code
|
207 |
+
``` shell
|
208 |
git clone https://github.com/LLaVA-VL/LLaVA-NeXT.git
|
209 |
```
|
210 |
* step 2: before installing LLaVA-NeXT, you need to modify `line 17` of [llava/model/multimodal_encoder/builder.py](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/llava/model/multimodal_encoder/builder.py#L17).
|
211 |
+
``` python
|
212 |
# Before modification:
|
213 |
if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
|
214 |
|
|
|
216 |
if "clip" in vision_tower or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
|
217 |
```
|
218 |
* step 3: install LLaVA-NeXT from source:
|
219 |
+
``` shell
|
220 |
cd LLaVA-NeXT
|
221 |
pip install --upgrade pip # Enable PEP 660 support.
|
222 |
pip install -e ".[train]"
|
|
|
225 |
We recommend the first way because it is simple.
|
226 |
</details>
|
227 |
|
228 |
+
</details>
|
229 |
+
|
230 |
**Load Model**
|
231 |
+
``` python
|
232 |
from llava.model.builder import load_pretrained_model
|
233 |
from llava.constants import DEFAULT_IMAGE_TOKEN
|
234 |
|
|
|
266 |
|
267 |
Our model can perform inference on images without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
|
268 |
|
269 |
+
``` python
|
270 |
import torch
|
271 |
import requests
|
272 |
from PIL import Image
|
|
|
319 |
Compared to the previous inference code, the following code has no modifications except for the input image, which is visual prompted with Set-of-Marks.
|
320 |
Refer to [this link](https://github.com/microsoft/SoM) to learn how to generate SoMs for an image.
|
321 |
|
322 |
+
``` python
|
323 |
import torch
|
324 |
import requests
|
325 |
from PIL import Image
|
|
|
374 |
|
375 |
Our model can perform inference on videos without [Set-of-Marks](https://arxiv.org/abs/2310.11441) visual prompts, in this case, it can be used in the same way as its base mode [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT).
|
376 |
|
377 |
+
``` python
|
378 |
import torch
|
379 |
import requests
|
380 |
from PIL import Image
|
|
|
437 |
Compared to the previous inference code, the following code has no modifications except for the input video, which is visual prompted with Set-of-Marks.
|
438 |
Refer to [SAM2](https://github.com/facebookresearch/sam2) and [SoM](https://github.com/microsoft/SoM) to learn how to generate SoMs for a video.
|
439 |
|
440 |
+
``` python
|
441 |
import torch
|
442 |
import requests
|
443 |
from PIL import Image
|
|
|
498 |
- Email (Lingchen Meng): [email protected]
|
499 |
|
500 |
## Citation
|
501 |
+
``` bibtex
|
502 |
@article{peng2024inst,
|
503 |
title={Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning},
|
504 |
author={Peng, Wujian and Meng, Lingchen and Chen, Yitong and Xie, Yiweng and Liu, Yang and Gui, Tao and Xu, Hang and Qiu, Xipeng and Wu, Zuxuan and Jiang, Yu-Gang},
|