|
--- |
|
license: apache-2.0 |
|
language: |
|
- vi |
|
- en |
|
- zh |
|
base_model: |
|
- Qwen/Qwen2-VL-2B-Instruct |
|
library_name: transformers |
|
tags: |
|
- erax |
|
- multimodal |
|
- erax-vl-2B |
|
- insurance |
|
- ocr |
|
- vietnamese |
|
- bcg |
|
pipeline_tag: visual-question-answering |
|
|
|
widget: |
|
- src: images/photo-1-16505057982762025719470.webp |
|
example_title: Test 1 |
|
- src: images/vt-don-thuoc-f0-7417.jpeg |
|
example_title: Test 2 |
|
--- |
|
<p align="left"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/63d8d8879dfcfa941d4d7cd9/GsQKdaTyn2FFx_cZvVHk3.png" alt="Logo"> |
|
</p> |
|
|
|
|
|
# EraX-VL-2B-V1.5 |
|
## Introduction 🎉 |
|
|
|
We are excited to introduce **EraX-VL-2B-V1.5**, a robust multimodal model for **OCR (optical character recognition)** and **VQA (visual question-answering)** that excels in various languages 🌍, with a particular focus on **Vietnamese 🇻🇳**. The `EraX-VL-2B` model stands out for its precise recognition capabilities across a range of documents 📝, including medical forms 🩺, invoices 🧾, bills of sale 💳, quotes 📄, and medical records 💊. This functionality is expected to be highly beneficial for hospitals 🏥, clinics 💉, insurance companies 🛡️, and other similar applications 📋. Built on the solid foundation of the [Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)[1], which we found to be of high quality and fluent in Vietnamese, `EraX-VL-2B` has been fine-tuned to enhance its performance. We plan to continue improving and releasing new versions for free, along with sharing performance benchmarks in the near future. |
|
|
|
One standing-out feature of **EraX-VL-2B-V1.5** is the capability to do multi-turn Q&A with reasonable reasoning capability at its small size of only +2 billions parameters. |
|
|
|
***NOTA BENE***: EraX-VL-2B-V1.5 is NOT a typical OCR-only tool likes Tesseract but is a Multimodal LLM-based model. To use it effectively, you may have to **twist your prompt carefully** depending on your tasks. |
|
|
|
**EraX-VL-2B-V1.5** is a young and tiny member of our **EraX's LànhGPT** collection of LLM models. |
|
|
|
- **Model type:** Multimodal Transformer with over 2B parameters |
|
- **Languages (NLP):** Primarily Vietnamese with multilingual capabilities |
|
- **License:** Apache 2.0 |
|
- **Fine-tuned from:** [Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) |
|
|
|
## Benchmarks 📊 |
|
|
|
## 🏆 LeaderBoard |
|
|
|
<table style="width:75%;"> |
|
<tr> |
|
<th align="middle" width="300">Models</th> |
|
<td align="middle" width="150"><b>Open-Source</b></td> |
|
<td align="middle" width="300"><b>VI-MTVQA</b></td> |
|
</tr> |
|
<tr> |
|
<th align="middle">EraX-VL-7B-V1.5 🥇 </th> |
|
<td align="middle">(soon))</td> |
|
<td align="middle">47.2 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle">Qwen2-VL 72B 🥈 </th> |
|
<td align="middle">✘</td> |
|
<td align="middle">41.6 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle">ViGPT-VL 🥉 </th> |
|
<td align="middle">✘</td> |
|
<td align="middle">39.1 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle"><font color=darkred>EraX-VL-2B-V1.5</font></th> |
|
<td align="middle"> ✅ </td> |
|
<td align="middle">38.2 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle"><font color=darkred>EraX-VL-7B-V1 </font></th> |
|
<td align="middle"> ✅ </td> |
|
<td align="middle">37.6 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle"><font color=darkred>Vintern-1B-V2</font></th> |
|
<td align="middle"> ✅ </td> |
|
<td align="middle">37.4 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle"><font color=darkred>Qwen2-VL 7B </font></th> |
|
<td align="middle"> ✅ </td> |
|
<td align="middle">30.0 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle">Claude3 Opus</th> |
|
<td align="middle">✘</td> |
|
<td align="middle">29.1 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle">GPT-4o mini </th> |
|
<td align="middle"> ✘ </td> |
|
<td align="middle">29.1 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle">GPT-4V</th> |
|
<td align="middle">✘</td> |
|
<td align="middle">28.9 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle">Gemini Ultra</th> |
|
<td align="middle">✘</td> |
|
<td align="middle">28.6 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle"><font color=darkred>InternVL2 76B</font></th> |
|
<td align="middle"> ✅ </td> |
|
<td align="middle">26.9 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle">QwenVL Max</th> |
|
<td align="middle">✘</td> |
|
<td align="middle">23.5 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle">Claude3 Sonnet</th> |
|
<td align="middle">✘</td> |
|
<td align="middle">20.8 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle">QwenVL Plus</th> |
|
<td align="middle">✘</td> |
|
<td align="middle">18.1 </td> |
|
</tr> |
|
<tr> |
|
<th align="middle"><font color=darkred>MiniCPM-V2.5</font></th> |
|
<td align="middle">✅</td> |
|
<td align="middle">15.3 </td> |
|
</tr> |
|
</table> |
|
|
|
**The test code for evaluating models in the paper can be found in**: <b><a href="https://github.com/EraX-JS-Company/EraX-MTVQA-Benchmark" target="_blank">EraX-JS-Company/EraX-MTVQA-Benchmark</a></b> |
|
|
|
## API trial 🎉 |
|
Please contact **[email protected]** for API access inquiry. |
|
|
|
## Examples 🧩 |
|
|
|
### 1. OCR - Optical Character Recognition for Multi-Images |
|
|
|
**Example 01: Citizen identification card** |
|
|
|
<div style="display: flex; flex-direction: row; align-items: center; justify-content: center;"> |
|
<div style="text-align: center; margin: 0 10px;"> |
|
<img src="images/trinhquangduy_front.jpg" width="500" alt="Front View" /> |
|
<p>Front View</p> |
|
</div> |
|
<div style="text-align: center; margin: 0 10px;"> |
|
<img src="images/trinhquangduy_back.jpg" width="500" alt="Back View" /> |
|
<p>Back View</p> |
|
</div> |
|
</div> |
|
<p style="text-align: center; font-size: 12px; color: gray; margin-top: 10px;"> |
|
Source: <a href="https://support.google.com/google-ads/thread/270967947/t%C3%B4i-%C4%91%C3%A3-g%E1%BB%ADi-h%C3%ACnh-%E1%BA%A3nh-c%C4%83n-c%C6%B0%E1%BB%9Bc-c%C3%B4ng-d%C3%A2n-c%E1%BB%A7a-ch%C3%ADnh-t%C3%B4i-%C4%91%E1%BB%83-x%C3%A1c-minh-danh-t%C3%ADnh?hl=vi" target="_blank">Google Support</a> |
|
</p> |
|
|
|
``` |
|
{ |
|
"Số thẻ":"037094012351" |
|
"Họ và tên":"TRỊNH QUANG DUY" |
|
"Ngày sinh":"04/09/1994" |
|
"Giới tính":"Nam" |
|
"Quốc tịch":"Việt Nam" |
|
"Quê quán / Place of origin":"Tân Thành, Kim Sơn, Ninh Bình" |
|
"Nơi thường trú / Place of residence":"Xóm 6 Tân Thành, Kim Sơn, Ninh Bình" |
|
"Có giá trị đến":"04/09/2034" |
|
"Đặc điểm nhân dạng / Personal identification":"seo chấm c:1cm trên đuôi mắt trái" |
|
"Cục trưởng cục cảnh sát quản lý hành chính về trật tự xã hội":"Nguyễn Quốc Hùng" |
|
"Ngày cấp":"10/12/2022" |
|
} |
|
``` |
|
|
|
**Example 01: Identity Card** |
|
|
|
|
|
<div style="display: flex; flex-direction: row; align-items: center; justify-content: center;"> |
|
<div style="text-align: center; margin: 0 10px;"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/63d8d8879dfcfa941d4d7cd9/4RD71oI0p04n1hAvLnqCR.jpeg" width="500" alt="Front View" /> |
|
<p>Front View</p> |
|
</div> |
|
<div style="text-align: center; margin: 0 10px;"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/63d8d8879dfcfa941d4d7cd9/zNOWjwAoS9zEH1wUt6Fh6.jpeg" width="500" alt="Back View" /> |
|
<p>Back View</p> |
|
</div> |
|
</div> |
|
<p style="text-align: center; font-size: 12px; color: gray; margin-top: 10px;"> |
|
Source: <a href="https://lamgiaynhanh.com/lam-giay-chung-minh-nhan-dan-gia-nhanh/" target="_blank">Internet</a> |
|
</p> |
|
|
|
``` |
|
{ |
|
"Số":"272737384" |
|
"Họ tên":"PHẠM NHẬT TRƯỜNG" |
|
"Sinh ngày":"08-08-2000" |
|
"Nguyên quán":"Tiền Giang" |
|
"Nơi ĐKHK thường trú":"393, Tân Xuân, Bảo Bình, Cẩm Mỹ, Đồng Nai" |
|
"Dân tộc":"Kinh" |
|
"Tôn giáo":"Không" |
|
"Đặc điểm nhận dạng":"Nốt ruồi c.3,5cm trên sau cánh mũi phải." |
|
"Ngày cấp":"30 tháng 01 năm 2018" |
|
"Giám đốc CA":"T.BÌNH ĐỊNH" |
|
} |
|
``` |
|
|
|
**Example 02: Driver's License** |
|
|
|
<div style="display: flex; flex-direction: row; align-items: center; justify-content: center;"> |
|
<div style="text-align: center; margin: 0 10px;"> |
|
<img src="images/nguyenvandung_front.png" width="500" alt="Front View" /> |
|
<p>Front View</p> |
|
</div> |
|
<div style="text-align: center; margin: 0 10px;"> |
|
<img src="images/nguyenvandung_back.png" width="500" alt="Back View" /> |
|
<p>Back View</p> |
|
</div> |
|
</div> |
|
<p style="text-align: center; font-size: 12px; color: gray; margin-top: 10px;"> |
|
Source: <a href="https://baophapluat.vn/khoi-to-tai-xe-len-mang-mua-giay-phep-lai-xe-gia-de-chay-xe-post481047.html" target="_blank">Báo Pháp luật</a> |
|
</p> |
|
|
|
``` |
|
{ |
|
"No.":"400116012313" |
|
"Fullname":"NGUYỄN VĂN DŨNG" |
|
"Date_of_birth":"08/06/1979" |
|
"Nationality":"VIỆT NAM" |
|
"Address":"X. Quỳnh Hầu, H. Quỳnh Lưu, T. Nghệ An |
|
Nghệ An, ngày/date 23 tháng/month 04 năm/year 2022" |
|
"Hang_Class":"FC" |
|
"Expires":"23/04/2027" |
|
"Place_of_issue":"Nghệ An" |
|
"Date_of_issue":"ngày/date 23 tháng/month 04 năm/year 2022" |
|
"Signer":"Trần Anh Tuấn" |
|
"Các loại xe được phép":"Ô tô hạng C kéo rơmoóc, đầu kéo kéo sơmi rơmoóc và xe hạng B1, B2, C, FB2 (Motor vehicle of class C with a trailer, semi-trailer truck and vehicles of classes B1, B2, C, FB2)" |
|
"Mã số":"" |
|
} |
|
``` |
|
|
|
**Example 03: Vehicle Registration Certificate** |
|
|
|
<div style="display: flex; flex-direction: row; align-items: center; justify-content: center;"> |
|
<div style="text-align: center; margin: 0 10px;"> |
|
<img src="images/nguyentonnhuan.jpg" width="500"/> |
|
</div> |
|
</div> |
|
<p style="text-align: center; font-size: 12px; color: gray; margin-top: 10px;"> |
|
Source: <a href="https://vietnamnet.vn/phan-biet-cac-loai-giay-dang-ky-xe-khi-mua-moto-da-qua-su-dung-541341.html" target="_blank">Báo Vietnamnet</a> |
|
</p> |
|
|
|
``` |
|
{ |
|
"Tên chủ xe":"NGUYỄN TÔN NHUẬN" |
|
"Địa chỉ":"KE27 Kp3 P.TTTây Q7" |
|
"Nhãn hiệu":"HONDA" |
|
"Số loại":"DYLAN" |
|
"Màu sơn":"Trắng" |
|
"Số người được phép chở":"02" |
|
"Nguồn gốc":"Xe nhập mới" |
|
"Biển số đăng ký":"59V1-498.89" |
|
"Đăng ký lần đầu ngày":"08/06/2004" |
|
"Số máy":"F03E-0057735" |
|
"Số khung":"5A04F-070410" |
|
"Dung tích":"152" |
|
"Quản lý":"TRƯỞNG CA QUẬN" |
|
"Thượng tá":"Trần Văn Hiểu" |
|
} |
|
``` |
|
|
|
**Example 04: Birth Certificate** |
|
|
|
<div style="display: flex; flex-direction: row; align-items: center; justify-content: center;"> |
|
<div style="text-align: center; margin: 0 10px;"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/63d8d8879dfcfa941d4d7cd9/nVy1v3bwBl5lP9fZIIeux.jpeg" width="500"/> |
|
</div> |
|
</div> |
|
<p style="text-align: center; font-size: 12px; color: gray; margin-top: 10px;"> |
|
Source: <a href="https://congchung247.com.vn/giay-khai-sinh-ban-chinh-co-the-lam-lai-duoc-khong/" target="_blank">https://congchung247.com.vn</a> |
|
</p> |
|
|
|
``` |
|
{ |
|
"name": "NGUYỄN NAM PHƯƠNG", |
|
"gender": "Nữ", |
|
"date_of_birth": "08/6/2011", |
|
"place_of_birth": "Bệnh viện Việt - Pháp Hà Nội", |
|
"nationality": "Việt Nam", |
|
"father_name": "Nguyễn Ninh Hồng Quang", |
|
"father_dob": "1980", |
|
"father_address": "309 nhà E2 Bạch Khoa - Hai Bà Trưng - Hà Nội", |
|
"mother_name": "Phạm Thùy Trang", |
|
"mother_dob": "1984", |
|
"mother_address": "309 nhà E2 Bạch Khoa - Hai Bà Trưng - Hà Nội", |
|
"registration_place": "UBND phường Bạch Khoa - Quận Hai Bà Trưng - Hà Nội", |
|
"registration_date": "05/8/2011", |
|
"registration_ralation": "cha", |
|
"notes": None, |
|
"certified_by": "Nguyễn Thị Kim Hoa" |
|
} |
|
``` |
|
|
|
## Quickstart 🎮 |
|
|
|
Install the necessary packages: |
|
```curl |
|
python -m pip install git+https://github.com/huggingface/transformers accelerate |
|
python -m pip install qwen-vl-utils |
|
pip install flash-attn --no-build-isolation |
|
``` |
|
|
|
Then you can use `EraX-VL-2B-V1.5` like this: |
|
```python |
|
import os |
|
import base64 |
|
import json |
|
|
|
import cv2 |
|
import numpy as np |
|
import matplotlib.pyplot as plt |
|
|
|
import torch |
|
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor |
|
from qwen_vl_utils import process_vision_info |
|
|
|
model_path = "erax/EraX-VL-2B-V1.5" |
|
|
|
model = Qwen2VLForConditionalGeneration.from_pretrained( |
|
model_path, |
|
torch_dtype=torch.bfloat16, |
|
attn_implementation="eager", # replace with "flash_attention_2" if your GPU is Ampere architecture |
|
device_map="auto" |
|
) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
# processor = AutoProcessor.from_pretrained(model_path) |
|
|
|
min_pixels = 256 * 28 * 28 |
|
max_pixels = 1280 * 28 * 28 |
|
processor = AutoProcessor.from_pretrained( |
|
model_path, |
|
min_pixels=min_pixels, |
|
max_pixels=max_pixels, |
|
) |
|
|
|
image_path ="image.jpg" |
|
|
|
with open(image_path, "rb") as f: |
|
encoded_image = base64.b64encode(f.read()) |
|
decoded_image_text = encoded_image.decode('utf-8') |
|
base64_data = f"data:image;base64,{decoded_image_text}" |
|
|
|
messages = [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{ |
|
"type": "image", |
|
"image": base64_data, |
|
}, |
|
{ |
|
"type": "text", |
|
"text": "Trích xuất thông tin nội dung từ hình ảnh được cung cấp." |
|
}, |
|
], |
|
} |
|
] |
|
|
|
# Prepare prompt |
|
tokenized_text = processor.apply_chat_template( |
|
messages, tokenize=False, add_generation_prompt=True |
|
) |
|
|
|
image_inputs, video_inputs = process_vision_info(messages) |
|
|
|
inputs = processor( |
|
text=[ tokenized_text], |
|
images=image_inputs, |
|
videos=video_inputs, |
|
padding=True, |
|
return_tensors="pt", |
|
) |
|
inputs = inputs.to("cuda") |
|
|
|
# Generation configs |
|
generation_config = model.generation_config |
|
generation_config.do_sample = True |
|
generation_config.temperature = 1.0 |
|
generation_config.top_k = 1 |
|
generation_config.top_p = 0.9 |
|
generation_config.min_p = 0.1 |
|
generation_config.best_of = 5 |
|
generation_config.max_new_tokens = 2048 |
|
generation_config.repetition_penalty = 1.06 |
|
|
|
# Inference |
|
generated_ids = model.generate(**inputs, generation_config=generation_config) |
|
generated_ids_trimmed = [ |
|
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) |
|
] |
|
output_text = processor.batch_decode( |
|
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
) |
|
|
|
print(output_text[0]) |
|
``` |
|
|
|
## References 📑 |
|
[1] Qwen team. Qwen2-VL. 2024. |
|
|
|
[2] Bai, Jinze, et al. "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond." arXiv preprint arXiv:2308.12966 (2023). |
|
|
|
[4] Yang, An, et al. "Qwen2 technical report." arXiv preprint arXiv:2407.10671 (2024). |
|
|
|
[5] Chen, Zhe, et al. "Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. |
|
|
|
[6] Chen, Zhe, et al. "How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites." arXiv preprint arXiv:2404.16821 (2024). |
|
|
|
[7] Tran, Chi, and Huong Le Thanh. "LaVy: Vietnamese Multimodal Large Language Model." arXiv preprint arXiv:2404.07922 (2024). |
|
|
|
|
|
## Contact 🤝 |
|
- For correspondence regarding this work or inquiry for API trial, please contact Nguyễn Anh Nguyên at [[email protected]]([email protected]). |
|
- Follow us on <b><a href="https://github.com/EraX-JS-Company" target="_blank">EraX Github</a></b> |
|
|