Ubuntu commited on
Commit
ce47433
ยท
1 Parent(s): 19e72b3
Files changed (4) hide show
  1. .mdl +0 -0
  2. .msc +0 -0
  3. .mv +0 -1
  4. README.md +172 -0
.mdl DELETED
Binary file (62 Bytes)
 
.msc DELETED
Binary file (1.77 kB)
 
.mv DELETED
@@ -1 +0,0 @@
1
- Revision:master,CreatedAt:1715951796
 
 
README.md ADDED
@@ -0,0 +1,172 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: cogvlm2
4
+ license_link: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B/blob/main/LICENSE
5
+
6
+ language:
7
+ - en
8
+ - chinese
9
+
10
+ pipeline_tag: text-generation
11
+ tags:
12
+ - chat
13
+ - cogvlm2
14
+
15
+ inference: false
16
+ ---
17
+
18
+ # CogVLM2
19
+
20
+ <div align="center">
21
+ <img src=https://github.com/THUDM/CogVLM2/blob/main/resources/logo.svg width="40%"/>
22
+ </div>
23
+ <p align="center">
24
+ ๐Ÿ‘‹ Join us on <a href="https://github.com/THUDM/CogVLM2/blob/main/resources/WECHAT.md" target="_blank">WeChat</a>
25
+ </p>
26
+ <p align="center">
27
+ ๐Ÿ“Experience the larger-scale CogVLM model on the <a href="https://open.bigmodel.cn/dev/api#super-humanoid">ZhipuAI Open Platform</a>.
28
+ </p>
29
+
30
+
31
+ ## Model introduction
32
+
33
+ We launch a new generation of **CogVLM2** series of models and open source two models built with [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct). Compared with the previous generation of CogVLM open source models, the CogVLM2 series of open source models have the following improvements:
34
+
35
+ 1. Significant improvements in many benchmarks such as `TextVQA`, `DocVQA`.
36
+ 2. Support **8K** content length.
37
+ 3. Support image resolution up to **1344 * 1344**.
38
+ 4. Provide an open source model version that supports both **Chinese and English**.
39
+
40
+ You can see the details of the CogVLM2 family of open source models in the table below:
41
+
42
+ | Model name | cogvlm2-llama3-chat-19B | cogvlm2-llama3-chinese-chat-19B |
43
+ |------------------|-------------------------------------|-------------------------------------|
44
+ | Base Model | Meta-Llama-3-8B-Instruct | Meta-Llama-3-8B-Instruct |
45
+ | Language | English | Chinese, English |
46
+ | Model size | 19B | 19B |
47
+ | Task | Image understanding, dialogue model | Image understanding, dialogue model |
48
+ | Text length | 8K | 8K |
49
+ | Image resolution | 1344 * 1344 | 1344 * 1344 |
50
+
51
+ ## Benchmark
52
+
53
+ Our open source models have achieved good results in many lists compared to the previous generation of CogVLM open source models. Its excellent performance can compete with some non-open source models, as shown in the table below:
54
+
55
+ | Model | Open Source | LLM Size | TextVQA | DocVQA | ChartQA | OCRbench | MMMU | MMVet | MMBench |
56
+ |--------------------------------|-------------|----------|----------|----------|----------|----------|----------|----------|----------|
57
+ | LLaVA-1.5 | โœ… | 13B | 61.3 | - | - | 337 | 37.0 | 35.4 | 67.7 |
58
+ | Mini-Gemini | โœ… | 34B | 74.1 | - | - | - | 48.0 | 59.3 | 80.6 |
59
+ | LLaVA-NeXT-LLaMA3 | โœ… | 8B | - | 78.2 | 69.5 | - | 41.7 | - | 72.1 |
60
+ | LLaVA-NeXT-110B | โœ… | 110B | - | 85.7 | 79.7 | - | 49.1 | - | 80.5 |
61
+ | InternVL-1.5 | โœ… | 20B | 80.6 | 90.9 | **83.8** | 720 | 46.8 | 55.4 | **82.3** |
62
+ | QwenVL-Plus | โŒ | - | 78.9 | 91.4 | 78.1 | 726 | 51.4 | 55.7 | 67.0 |
63
+ | Claude3-Opus | โŒ | - | - | 89.3 | 80.8 | 694 | **59.4** | 51.7 | 63.3 |
64
+ | Gemini Pro 1.5 | โŒ | - | 73.5 | 86.5 | 81.3 | - | 58.5 | - | - |
65
+ | GPT-4V | โŒ | - | 78.0 | 88.4 | 78.5 | 656 | 56.8 | **67.7** | 75.0 |
66
+ | CogVLM1.1 (Ours) | โœ… | 7B | 69.7 | - | 68.3 | 590 | 37.3 | 52.0 | 65.8 |
67
+ | CogVLM2-LLaMA3 (Ours) | โœ… | 8B | 84.2 | **92.3** | 81.0 | 756 | 44.3 | 60.4 | 80.5 |
68
+ | CogVLM2-LLaMA3-Chinese (Ours) | โœ… | 8B | **85.0** | 88.4 | 74.7 | **780** | 42.8 | 60.5 | 78.9 |
69
+
70
+ All reviews were obtained without using any external OCR tools ("pixel only").
71
+ ## Quick Start
72
+
73
+ here is a simple example of how to use the model to chat with the CogVLM2 model.
74
+ ```python
75
+ import torch
76
+ from PIL import Image
77
+ from transformers import AutoModelForCausalLM, AutoTokenizer
78
+
79
+ MODEL_PATH = "THUDM/cogvlm2-llama3-chinese-chat-19B"
80
+ DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
81
+ TORCH_TYPE = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 8 else torch.float16
82
+
83
+ tokenizer = AutoTokenizer.from_pretrained(
84
+ MODEL_PATH,
85
+ trust_remote_code=True
86
+ )
87
+ model = AutoModelForCausalLM.from_pretrained(
88
+ MODEL_PATH,
89
+ torch_dtype=TORCH_TYPE,
90
+ trust_remote_code=True,
91
+ ).to(DEVICE).eval()
92
+
93
+ text_only_template = "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: {} ASSISTANT:"
94
+
95
+ while True:
96
+ image_path = input("image path >>>>> ")
97
+ if image_path == '':
98
+ print('You did not enter image path, the following will be a plain text conversation.')
99
+ image = None
100
+ text_only_first_query = True
101
+ else:
102
+ image = Image.open(image_path).convert('RGB')
103
+
104
+ history = []
105
+
106
+ while True:
107
+ query = input("Human:")
108
+ if query == "clear":
109
+ break
110
+
111
+ if image is None:
112
+ if text_only_first_query:
113
+ query = text_only_template.format(query)
114
+ text_only_first_query = False
115
+ else:
116
+ old_prompt = ''
117
+ for _, (old_query, response) in enumerate(history):
118
+ old_prompt += old_query + " " + response + "\n"
119
+ query = old_prompt + "USER: {} ASSISTANT:".format(query)
120
+ if image is None:
121
+ input_by_model = model.build_conversation_input_ids(
122
+ tokenizer,
123
+ query=query,
124
+ history=history,
125
+ template_version='chat'
126
+ )
127
+ else:
128
+ input_by_model = model.build_conversation_input_ids(
129
+ tokenizer,
130
+ query=query,
131
+ history=history,
132
+ images=[image],
133
+ template_version='chat'
134
+ )
135
+ inputs = {
136
+ 'input_ids': input_by_model['input_ids'].unsqueeze(0).to(DEVICE),
137
+ 'token_type_ids': input_by_model['token_type_ids'].unsqueeze(0).to(DEVICE),
138
+ 'attention_mask': input_by_model['attention_mask'].unsqueeze(0).to(DEVICE),
139
+ 'images': [[input_by_model['images'][0].to(DEVICE).to(TORCH_TYPE)]] if image is not None else None,
140
+ }
141
+ gen_kwargs = {
142
+ "max_new_tokens": 2048,
143
+ "pad_token_id": 128002,
144
+ }
145
+ with torch.no_grad():
146
+ outputs = model.generate(**inputs, **gen_kwargs)
147
+ outputs = outputs[:, inputs['input_ids'].shape[1]:]
148
+ response = tokenizer.decode(outputs[0])
149
+ response = response.split("<|end_of_text|>")[0]
150
+ print("\nCogVLM2:", response)
151
+ history.append((query, response))
152
+ ```
153
+
154
+
155
+ ## License
156
+
157
+ This model is released under the CogVLM2 [LICENSE](LICENSE). For models built with Meta Llama 3, please also adhere to the [LLAMA3_LICENSE](LLAMA3_LICENSE).
158
+
159
+ ## Citation
160
+
161
+ If you find our work helpful, please consider citing the following papers
162
+
163
+ ```
164
+ @misc{wang2023cogvlm,
165
+ title={CogVLM: Visual Expert for Pretrained Language Models},
166
+ author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},
167
+ year={2023},
168
+ eprint={2311.03079},
169
+ archivePrefix={arXiv},
170
+ primaryClass={cs.CV}
171
+ }
172
+ ```