--- license: apache-2.0 --- # syntheticbot/Qwen-VL-7B-ocr ## Introduction syntheticbot/Qwen-VL-7B-ocr is a fine-tuned model for Optical Character Recognition (OCR) tasks, derived from the base model [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct). This model is engineered for high accuracy in extracting text from images, including documents and scenes containing text. #### Key Enhancements for OCR: * **Enhanced Text Recognition Accuracy**: Superior accuracy across diverse text fonts, styles, sizes, and orientations. * **Robustness to Document Variations**: Specifically trained to manage document complexities like varied layouts, noise, and distortions. * **Structured Output Generation**: Enables structured output formats (JSON, CSV) for recognized text and layout in document images such as invoices and tables. * **Text Localization**: Provides accurate localization of text regions and bounding boxes for text elements within images. * **Improved Handling of Text in Visuals**: Maintains proficiency in analyzing charts and graphics, with enhanced recognition of embedded text. #### Model Architecture Updates: * **Dynamic Resolution and Frame Rate Training for Video Understanding**:
* **Streamlined and Efficient Vision Encoder**
This repository provides the instruction-tuned and OCR-optimized 7B Qwen-VL-7B-ocr model. For comprehensive details about the foundational model architecture, please refer to the [Qwen/Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) repository, as well as the [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL) pages for Qwen2.5-VL.
## Requirements
For optimal performance and access to OCR-specific features, it is recommended to build from source:
```
pip install git+https://github.com/huggingface/transformers accelerate
```
## Quickstart
The following examples illustrate the use of syntheticbot/Qwen-VL-7B-ocr with 🤗 Transformers and `qwen_vl_utils` for OCR applications.
```
pip install git+https://github.com/huggingface/transformers accelerate
```
Install the toolkit for streamlined visual input processing:
```bash
pip install qwen-vl-utils[decord]==0.0.8
```
### Using 🤗 Transformers for OCR
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"syntheticbot/Qwen-VL-7B-ocr",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("syntheticbot/Qwen-VL-7B-ocr")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/your/document_image.jpg",
},
{"type": "text", "text": "Extract the text from this image."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda" if torch.cuda.is_available() else "cpu")
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Extracted Text:", output_text[0])
```
Example for Structured Output (JSON for Table Extraction)
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
import json
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"syntheticbot/Qwen-VL-7B-ocr",
torch_dtype="auto",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("syntheticbot/Qwen-VL-7B-ocr")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "path/to/your/table_image.jpg",
},
{"type": "text", "text": "Extract the table from this image and output it as JSON."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda" if torch.cuda.is_available() else "cpu")
generated_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Extracted Table (JSON):\n", output_text[0])
try:
json_output = json.loads(output_text[0])
print("\nParsed JSON Output:\n", json.dumps(json_output, indent=2))
except json.JSONDecodeError:
print("\nCould not parse output as JSON. Output is plain text.")
```
Batch inference for OCR
```python
messages1 = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image1.jpg"},
{"type": "text", "text": "Extract text from this image."},
],
}
]
messages2 = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image2.jpg"},
{"type": "text", "text": "Read the text in this document."},
],
}
]
messages = [messages1, messages2]
texts = [
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=texts,
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda" if torch.cuda.is_available() else "cpu")
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("Extracted Texts (Batch):\n", output_texts)
```