qingshan777 commited on
Commit
b2e192f
·
verified ·
1 Parent(s): b1064a8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +188 -3
README.md CHANGED
@@ -1,3 +1,188 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - liuhaotian/LLaVA-CC3M-Pretrain-595K
5
+ - liuhaotian/LLaVA-Instruct-150K
6
+ - FreedomIntelligence/ALLaVA-4V-Chinese
7
+ - shareAI/ShareGPT-Chinese-English-90k
8
+ language:
9
+ - zh
10
+ - en
11
+ pipeline_tag: visual-question-answering
12
+ ---
13
+ <br>
14
+ <br>
15
+
16
+ # Model Card for IAA: Inner-Adaptor Architecture
17
+
18
+ **Github**:https://github.com/360CVGroup/Inner-Adaptor-Architecture
19
+
20
+ **[IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities](https://www.arxiv.org/abs/2408.12902)**
21
+
22
+ </br>
23
+ Bin Wang*, Chunyu Xie*, Dawei Leng†, Yuhui Yin(*Equal Contribution, ✝Corresponding Author)
24
+ </br>
25
+ [![arXiv](https://img.shields.io/badge/arXiv-2408.12902-b31b1b.svg)](https://www.arxiv.org/abs/2408.12902)
26
+
27
+ We propose a MLLM based on Inner-Adaptor Architecture (IAA). IAA demonstrates that training with a frozen language model can surpass the models with fine-tuned LLMs in both multimodal comprehension and visual grounding tasks. Moreover, after deployment, our approach incorporates multiple workflows, thereby preserving the NLP proficiency of the language model. With a single download, the model can be finetuned to cater to various task specifications. Enjoy the seamless experience of utilizing our IAA model.
28
+
29
+
30
+ <p align="center">
31
+ <img src="https://github.com/360CVGroup/Inner-Adaptor-Architecture/iaa/overview.png" width=80%/>
32
+ </p>
33
+
34
+
35
+ ## Model Performance
36
+ ### Main Results on General Multimodal Benchmarks.
37
+
38
+ <p align="center">
39
+ <img src="https://github.com/360CVGroup/Inner-Adaptor-Architecture/iaa/mmresult.png" width=90%/>
40
+ </p>
41
+
42
+ ### Results on Visual Grounding Benchmarks.
43
+ <!-- grounding_re -->
44
+
45
+ <p align="center">
46
+ <img src="https://github.com/360CVGroup/Inner-Adaptor-Architecture/iaa/grounding_re.png" width=90%/>
47
+ </p>
48
+
49
+ ### Comparison on text-only question answering.
50
+ <!-- grounding_re -->
51
+
52
+ <p align="center">
53
+ <img src="https://github.com/360CVGroup/Inner-Adaptor-Architecture/iaa/NLPresult.png" width=90%/>
54
+ </p>
55
+
56
+ ## Quick Start 🤗
57
+ ### First pull off our model
58
+ ```Shell
59
+ from transformers import AutoModelForCausalLM, AutoTokenizer
60
+ import torch
61
+ from PIL import Image
62
+
63
+ checkpoint = "qihoo360/iaa-14-hf"
64
+
65
+ model = AutoModelForCausalLM.from_pretrained(checkpoint, torch_dtype=torch.float16, device_map='cuda', trust_remote_code=True).eval()
66
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
67
+ vision_tower = model.get_vision_tower()
68
+ vision_tower.load_model()
69
+ vision_tower.to(device="cuda", dtype=torch.float16)
70
+ image_processor = vision_tower.image_processor
71
+ tokenizer.pad_token = tokenizer.eos_token
72
+
73
+ terminators = [
74
+ tokenizer.convert_tokens_to_ids("<|eot_id|>",)
75
+ ]
76
+ ```
77
+
78
+
79
+
80
+ ### Multimodal Workflow: task_type="MM"
81
+ ```Shell
82
+ image = Image.open("readpanda.jpg").convert('RGB')
83
+ query = "What animal is in the picture?"
84
+
85
+ inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)
86
+
87
+ input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
88
+ images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)
89
+
90
+ output_ids = model.generate(
91
+ input_ids,
92
+ task_type="MM",
93
+ images=images,
94
+ do_sample=False,
95
+ eos_token_id=terminators,
96
+ num_beams=1,
97
+ max_new_tokens=512,
98
+ use_cache=True)
99
+
100
+ input_token_len = input_ids.shape[1]
101
+ outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
102
+ outputs = outputs.strip()
103
+ print(outputs)
104
+ ```
105
+
106
+ ### Grounding Workflow: task_type="G"
107
+ ```Shell
108
+ image = Image.open("COCO_train2014_000000014502.jpg").convert('RGB')
109
+ query = "Please provide the bounding box coordinate of the region this sentence describes: dude with black shirt says circa."
110
+
111
+ inputs = model.build_conversation_input_ids(tokenizer, query=query, image=image, image_processor=image_processor)
112
+
113
+ input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
114
+ images = inputs["image"].to(dtype=torch.float16, device='cuda', non_blocking=True)
115
+
116
+ output_ids = model.generate(
117
+ input_ids,
118
+ task_type="G",
119
+ images=images,
120
+ do_sample=False,
121
+ eos_token_id=terminators,
122
+ num_beams=1,
123
+ max_new_tokens=512,
124
+ use_cache=True)
125
+ input_token_len = input_ids.shape[1]
126
+ outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
127
+ outputs = outputs.strip()
128
+ print(outputs)
129
+ ```
130
+
131
+ ### Text-only Workflow: task_type="Text"
132
+
133
+ ```Shell
134
+ query = "What is the approximate weight of an adult red panda?"
135
+ inputs = model.build_conversation_input_ids(tokenizer, query=query)
136
+
137
+ input_ids = inputs["input_ids"].to(device='cuda', non_blocking=True)
138
+ images = None
139
+
140
+
141
+ output_ids = model.generate(
142
+ input_ids,
143
+ task_type="Text",
144
+ images=images,
145
+ do_sample=False,
146
+ eos_token_id=terminators,
147
+ num_beams=1,
148
+ max_new_tokens=512,
149
+ use_cache=True)
150
+
151
+ input_token_len = input_ids.shape[1]
152
+ outputs = tokenizer.batch_decode(output_ids[:, input_token_len:], skip_special_tokens=True)[0]
153
+ outputs = outputs.strip()
154
+ print(outputs)
155
+ ```
156
+
157
+ ## We Are Hiring
158
+ We are seeking academic interns in the Multimodal field. If interested, please send your resume to [email protected].
159
+
160
+ ## Citation
161
+ If you find IAA useful for your research and applications, please cite using this BibTeX:
162
+
163
+ ```
164
+ @article{Wang2024IAA,
165
+ title={IAA: Inner-Adaptor Architecture Empowers Frozen Large Language Model with Multimodal Capabilities},
166
+ author={Bin Wang and Chunyu Xie and Dawei Leng and Yuhui Yin},
167
+ journal={arXiv preprint arXiv:2408.12902},
168
+ year={2024},
169
+ }
170
+ ```
171
+
172
+ ## License
173
+ This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.
174
+ The content of this project itself is licensed under the [Apache license 2.0]
175
+
176
+ **Where to send questions or comments about the model:**
177
+ https://github.com/360CVGroup/Inner-Adaptor-Architecture
178
+
179
+
180
+
181
+ ## Related Projects
182
+ This work wouldn't be possible without the incredible open-source code of these projects. Huge thanks!
183
+ - [Meta Llama 3](https://github.com/meta-llama/llama3)
184
+ - [LLaVA: Large Language and Vision Assistant](https://github.com/haotian-liu/LLaVA)
185
+ - [360VL](https://github.com/360CVGroup/360VL)
186
+
187
+
188
+