Safetensors
qwen2_5_vl
syntheticbot commited on
Commit
dfc7fce
·
verified ·
1 Parent(s): 578999f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +301 -3
README.md CHANGED
@@ -1,3 +1,301 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # syntheticbot/Qwen-VL-7B-ocr
6
+
7
+
8
+ ## Introduction
9
+
10
+ syntheticbot/Qwen-VL-7B-ocr is a fine-tuned model for Optical Character Recognition (OCR) tasks, derived from the base model [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). This model is engineered for high accuracy in extracting text from images, including documents and scenes containing text.
11
+
12
+ #### Key Enhancements for OCR:
13
+
14
+ * **Enhanced Text Recognition Accuracy**: Superior accuracy across diverse text fonts, styles, sizes, and orientations.
15
+ * **Robustness to Document Variations**: Specifically trained to manage document complexities like varied layouts, noise, and distortions.
16
+ * **Structured Output Generation**: Enables structured output formats (JSON, CSV) for recognized text and layout in document images such as invoices and tables.
17
+ * **Text Localization**: Provides accurate localization of text regions and bounding boxes for text elements within images.
18
+ * **Improved Handling of Text in Visuals**: Maintains proficiency in analyzing charts and graphics, with enhanced recognition of embedded text.
19
+
20
+
21
+ #### Model Architecture Updates:
22
+
23
+ * **Dynamic Resolution and Frame Rate Training for Video Understanding**:
24
+ <p align="center">
25
+ <img src="https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-VL/qwen2.5vl_arc.jpeg" width="80%"/>
26
+ <p>
27
+
28
+ * **Streamlined and Efficient Vision Encoder**
29
+
30
+ This repository provides the instruction-tuned and OCR-optimized 7B Qwen-VL-7B-ocr model. For comprehensive details about the foundational model architecture, please refer to the [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) repository, as well as the [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL) pages for Qwen2.5-VL.
31
+
32
+
33
+ ## Evaluation
34
+
35
+ ### OCR Benchmarks
36
+
37
+ | Benchmark | Qwen2-VL-7B |syntheticbot/Qwen-VL-7B-ocr | Improvement | Notes |
38
+ | :--- | :---: | :---: | :---: | :---: |
39
+ | DocVQA<sub>test</sub> | 94.5 | **96.5** | +2.0 | Document VQA, OCR accuracy relevant |
40
+ | InfoVQA<sub>test</sub> | 76.5 | **84.5** | +8.0 | Information seeking VQA, OCR accuracy crucial |
41
+ | ChartQA<sub>test</sub> | 83.0 | **89.0** | +6.0 | Chart understanding with text, OCR accuracy important |
42
+ | TextVQA<sub>val</sub> | 84.3 | **86.3** | +2.0 | Text-based VQA, direct OCR relevance |
43
+ | OCRBench | 845 | **885** | +40 | Direct OCR benchmark |
44
+ | CC_OCR | 61.6 | **81.8**| +20.2 | Chinese Character OCR benchmark |
45
+ | MMStar (Text Reading Focus) | 60.7| **65.9**| +5.2 | MMStar with focus on text reading tasks |
46
+ | **Average OCR-Related Score** | **77.8** | **84.9** | **+7.1** | Approximate average across OCR-focused benchmarks |
47
+
48
+
49
+ ## Requirements
50
+ For optimal performance and access to OCR-specific features, it is recommended to build from source:
51
+ ```
52
+ pip install git+https://github.com/huggingface/transformers accelerate
53
+ ```
54
+
55
+
56
+ ## Quickstart
57
+
58
+ The following examples illustrate the use of syntheticbot/Qwen-VL-7B-ocr with 🤗 Transformers and `qwen_vl_utils` for OCR applications.
59
+
60
+ ```
61
+ pip install git+https://github.com/huggingface/transformers accelerate
62
+ ```
63
+
64
+ Install the toolkit for streamlined visual input processing:
65
+
66
+ ```bash
67
+ pip install qwen-vl-utils[decord]==0.0.8
68
+ ```
69
+
70
+ ### Using 🤗 Transformers for OCR
71
+
72
+ ```python
73
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
74
+ from qwen_vl_utils import process_vision_info
75
+ import torch
76
+
77
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
78
+ "syntheticbot/Qwen-VL-7B-ocr",
79
+ torch_dtype="auto",
80
+ device_map="auto"
81
+ )
82
+
83
+ processor = AutoProcessor.from_pretrained("syntheticbot/Qwen-VL-7B-ocr")
84
+
85
+ messages = [
86
+ {
87
+ "role": "user",
88
+ "content": [
89
+ {
90
+ "type": "image",
91
+ "image": "path/to/your/document_image.jpg",
92
+ },
93
+ {"type": "text", "text": "Extract the text from this image."},
94
+ ],
95
+ }
96
+ ]
97
+
98
+ text = processor.apply_chat_template(
99
+ messages, tokenize=False, add_generation_prompt=True
100
+ )
101
+ image_inputs, video_inputs = process_vision_info(messages)
102
+ inputs = processor(
103
+ text=[text],
104
+ images=image_inputs,
105
+ videos=video_inputs,
106
+ padding=True,
107
+ return_tensors="pt",
108
+ )
109
+ inputs = inputs.to("cuda" if torch.cuda.is_available() else "cpu")
110
+
111
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
112
+ generated_ids_trimmed = [
113
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
114
+ ]
115
+ output_text = processor.batch_decode(
116
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
117
+ )
118
+ print("Extracted Text:", output_text[0])
119
+ ```
120
+
121
+ <details>
122
+ <summary>Example for Structured Output (JSON for Table Extraction)</summary>
123
+
124
+ ```python
125
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
126
+ from qwen_vl_utils import process_vision_info
127
+ import torch
128
+ import json
129
+
130
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
131
+ "syntheticbot/Qwen-VL-7B-ocr",
132
+ torch_dtype="auto",
133
+ device_map="auto"
134
+ )
135
+ processor = AutoProcessor.from_pretrained("syntheticbot/Qwen-VL-7B-ocr")
136
+
137
+
138
+ messages = [
139
+ {
140
+ "role": "user",
141
+ "content": [
142
+ {
143
+ "type": "image",
144
+ "image": "path/to/your/table_image.jpg",
145
+ },
146
+ {"type": "text", "text": "Extract the table from this image and output it as JSON."},
147
+ ],
148
+ }
149
+ ]
150
+
151
+ text = processor.apply_chat_template(
152
+ messages, tokenize=False, add_generation_prompt=True
153
+ )
154
+ image_inputs, video_inputs = process_vision_info(messages)
155
+ inputs = processor(
156
+ text=[text],
157
+ images=image_inputs,
158
+ videos=video_inputs,
159
+ padding=True,
160
+ return_tensors="pt",
161
+ )
162
+ inputs = inputs.to("cuda" if torch.cuda.is_available() else "cpu")
163
+
164
+ generated_ids = model.generate(**inputs, max_new_tokens=1024)
165
+ generated_ids_trimmed = [
166
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
167
+ ]
168
+ output_text = processor.batch_decode(
169
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
170
+ )
171
+ print("Extracted Table (JSON):\n", output_text[0])
172
+
173
+ try:
174
+ json_output = json.loads(output_text[0])
175
+ print("\nParsed JSON Output:\n", json.dumps(json_output, indent=2))
176
+ except json.JSONDecodeError:
177
+ print("\nCould not parse output as JSON. Output is plain text.")
178
+ ```
179
+ </details>
180
+
181
+ <details>
182
+ <summary>Batch inference for OCR</summary>
183
+
184
+ ```python
185
+ messages1 = [
186
+ {
187
+ "role": "user",
188
+ "content": [
189
+ {"type": "image", "image": "path/to/image1.jpg"},
190
+ {"type": "text", "text": "Extract text from this image."},
191
+ ],
192
+ }
193
+ ]
194
+ messages2 = [
195
+ {
196
+ "role": "user",
197
+ "content": [
198
+ {"type": "image", "image": "path/to/image2.jpg"},
199
+ {"type": "text", "text": "Read the text in this document."},
200
+ ],
201
+ }
202
+ ]
203
+ messages = [messages1, messages2]
204
+
205
+ texts = [
206
+ processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
207
+ for msg in messages
208
+ ]
209
+ image_inputs, video_inputs = process_vision_info(messages)
210
+ inputs = processor(
211
+ text=texts,
212
+ images=image_inputs,
213
+ videos=video_inputs,
214
+ padding=True,
215
+ return_tensors="pt",
216
+ )
217
+ inputs = inputs.to("cuda" if torch.cuda.is_available() else "cpu")
218
+
219
+ generated_ids = model.generate(**inputs, max_new_tokens=512)
220
+ generated_ids_trimmed = [
221
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
222
+ ]
223
+ output_texts = processor.batch_decode(
224
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
225
+ )
226
+ print("Extracted Texts (Batch):\n", output_texts)
227
+ ```
228
+ </details>
229
+
230
+
231
+ ### 🤖 ModelScope
232
+ For users in mainland China, ModelScope is recommended. Use `snapshot_download` for checkpoint management. Adapt model names to `syntheticbot/Qwen-VL-7B-ocr` in ModelScope implementations.
233
+
234
+
235
+ ### More Usage Tips for OCR
236
+
237
+ Input images support local files, URLs, and base64 encoding.
238
+
239
+ ```python
240
+ messages = [ { "role": "user", "content": [ {"type": "image", "image": "http://path/to/your/document_image.jpg"}, {"type": "text", "text": "Extract the text from this image URL."}, ], }]
241
+ ```
242
+ #### Image Resolution for OCR Accuracy
243
+
244
+ Higher resolution images typically improve OCR accuracy, especially for small text. Adjust resolution using `min_pixels`, `max_pixels`, `resized_height`, and `resized_width` parameters with the processor.
245
+
246
+ ```python
247
+ min_pixels = 512 * 28 * 28
248
+ max_pixels = 2048 * 28 * 28
249
+ processor = AutoProcessor.from_pretrained(
250
+ "syntheticbot/Qwen-VL-7B-ocr",
251
+ min_pixels=min_pixels, max_pixels=max_pixels
252
+ )
253
+ ```
254
+
255
+ Control resizing dimensions directly:
256
+
257
+ ```python
258
+ messages = [
259
+ {
260
+ "role": "user",
261
+ "content": [
262
+ {
263
+ "type": "image",
264
+ "image": "file:///path/to/your/document_image.jpg",
265
+ "resized_height": 600,
266
+ "resized_width": 800,
267
+ },
268
+ {"type": "text", "text": "Extract the text."},
269
+ ],
270
+ }
271
+ ]
272
+ ```
273
+
274
+
275
+ ## Citation
276
+
277
+ If you utilize syntheticbot/Qwen-VL-7B-ocr, please cite the base Qwen2.5-VL models:
278
+
279
+ ```
280
+ @misc{qwen2.5-VL,
281
+ title = {Qwen2.5-VL},
282
+ url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
283
+ author = {Qwen Team},
284
+ month = {January},
285
+ year = {2025}
286
+ }
287
+
288
+ @article{Qwen2VL,
289
+ title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
290
+ author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
291
+ journal={arXiv preprint arXiv:2409.12191},
292
+ year={2024}
293
+ }
294
+
295
+ @article{Qwen-VL,
296
+ title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
297
+ author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
298
+ journal={arXiv preprint arXiv:2308.12966},
299
+ year={2023}
300
+ }
301
+ ```