EraX-VL-2B-V1.5 / README.md

Update README.md

e4d4b5c verified about 2 months ago

16.2 kB

	---
	license: apache-2.0
	language:
	- vi
	- en
	- zh
	base_model:
	- Qwen/Qwen2-VL-2B-Instruct
	library_name: transformers
	tags:
	- erax
	- multimodal
	- erax-vl-2B
	- insurance
	- ocr
	- vietnamese
	- bcg
	- image-to-text
	- image-text-to-text
	pipeline_tag: visual-question-answering

	widget:
	- src: images/photo-1-16505057982762025719470.webp
	example_title: Test 1
	- src: images/vt-don-thuoc-f0-7417.jpeg
	example_title: Test 2
	---
	<p align="left">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/63d8d8879dfcfa941d4d7cd9/GsQKdaTyn2FFx_cZvVHk3.png" alt="Logo">
	</p>


	# EraX-VL-2B-V1.5
	## Introduction 🎉

	Hot on the heels of the popular <a href="https://huggingface.co/erax-ai/EraX-VL-7B-V1.0" target="_blank">EraX-VL-7B-V1.0 model</a>, we proudly present EraX-VL-2B-V1.5. This enhanced multimodal model offers robust OCR and VQA capabilities across diverse languages 🌍, with a significant advantage in processing Vietnamese 🇻🇳. The `EraX-VL-2B` model stands out for its precise recognition capabilities across a range of documents 📝, including medical forms 🩺, invoices 🧾, bills of sale 💳, quotes 📄, and medical records 💊. This functionality is expected to be highly beneficial for hospitals 🏥, clinics 💉, insurance companies 🛡️, and other similar applications 📋. Built on the solid foundation of the [Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)[1], which we found to be of high quality and fluent in Vietnamese, `EraX-VL-2B` has been fine-tuned to enhance its performance. We plan to continue improving and releasing new versions for free, along with sharing performance benchmarks in the near future.

	One standing-out feature of EraX-VL-2B-V1.5 is the capability to do multi-turn Q&A with reasonable reasoning capability at its small size of only +2 billions parameters.

	*NOTA BENE*:
	- EraX-VL-2B-V1.5 is NOT a typical OCR-only tool likes Tesseract but is a Multimodal LLM-based model. To use it effectively, you may have to twist your prompt carefully depending on your tasks.
	- This model was NOT finetuned with medical (X-ray) dataset or car accidences (yet). Stay tune for updated version coming up sometime 2025.

	EraX-VL-2B-V1.5 is a young and tiny member of our EraX's LànhGPT collection of LLM models.

	- Developed by:
	- Nguyễn Anh Nguyên ([email protected])
	- Nguyễn Hồ Nam (BCG)
	- Phạm Huỳnh Nhật ([email protected])
	- Phạm Đình Thục ([email protected])
	- Funded by: [Bamboo Capital Group](https://bamboocap.com.vn) and EraX
	- Model type: Multimodal Transformer with over 2B parameters
	- Languages (NLP): Primarily Vietnamese with multilingual capabilities
	- License: Apache 2.0
	- Fine-tuned from: [Qwen/Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct)
	- Prompt examples: <b><a href="https://github.com/EraX-JS-Company/erax-vl-7b-v1/blob/main/prompts/Vietnam_popular_prompts.txt" target="_blank">Some popular prompt examples.</a>

	## Benchmarks 📊

	## 🏆 LeaderBoard

	<table style="width:75%;">
	<tr>
	<th align="middle" width="300">Models</th>
	<td align="middle" width="150"><b>Open-Source</b></td>
	<td align="middle" width="300"><b>VI-MTVQA</b></td>
	</tr>
	<tr>
	<th align="middle"><font color=darkred>EraX-VL-7B-V1.5 🥇 </font></th>
	<td align="middle"> ✅ </td>
	<td align="middle">47.2 </td>
	</tr>
	<tr>
	<th align="middle">Qwen2-VL 72B 🥈 </th>
	<td align="middle">✘</td>
	<td align="middle">41.6 </td>
	</tr>
	<tr>
	<th align="middle">ViGPT-VL 🥉 </th>
	<td align="middle">✘</td>
	<td align="middle">39.1 </td>
	</tr>
	<tr>
	<th align="middle"><font color=darkred>EraX-VL-2B-V1.5</font></th>
	<td align="middle"> ✅ </td>
	<td align="middle">38.2 </td>
	</tr>
	<tr>
	<th align="middle"><font color=darkred>EraX-VL-7B-V1 </font></th>
	<td align="middle"> ✅ </td>
	<td align="middle">37.6 </td>
	</tr>
	<tr>
	<th align="middle"><font color=darkred>Vintern-1B-V2</font></th>
	<td align="middle"> ✅ </td>
	<td align="middle">37.4 </td>
	</tr>
	<tr>
	<th align="middle"><font color=darkred>Qwen2-VL 7B </font></th>
	<td align="middle"> ✅ </td>
	<td align="middle">30.0 </td>
	</tr>
	<tr>
	<th align="middle">Claude3 Opus</th>
	<td align="middle">✘</td>
	<td align="middle">29.1 </td>
	</tr>
	<tr>
	<th align="middle">GPT-4o mini </th>
	<td align="middle"> ✘ </td>
	<td align="middle">29.1 </td>
	</tr>
	<tr>
	<th align="middle">GPT-4V</th>
	<td align="middle">✘</td>
	<td align="middle">28.9 </td>
	</tr>
	<tr>
	<th align="middle">Gemini Ultra</th>
	<td align="middle">✘</td>
	<td align="middle">28.6 </td>
	</tr>
	<tr>
	<th align="middle"><font color=darkred>InternVL2 76B</font></th>
	<td align="middle"> ✅ </td>
	<td align="middle">26.9 </td>
	</tr>
	<tr>
	<th align="middle">QwenVL Max</th>
	<td align="middle">✘</td>
	<td align="middle">23.5 </td>
	</tr>
	<tr>
	<th align="middle">Claude3 Sonnet</th>
	<td align="middle">✘</td>
	<td align="middle">20.8 </td>
	</tr>
	<tr>
	<th align="middle">QwenVL Plus</th>
	<td align="middle">✘</td>
	<td align="middle">18.1 </td>
	</tr>
	<tr>
	<th align="middle"><font color=darkred>MiniCPM-V2.5</font></th>
	<td align="middle">✅</td>
	<td align="middle">15.3 </td>
	</tr>
	</table>

	The test code for evaluating models in the paper can be found in: <b><a href="https://github.com/EraX-JS-Company/EraX-MTVQA-Benchmark" target="_blank">EraX-JS-Company/EraX-MTVQA-Benchmark</a></b>

	## API trial 🎉
	Please contact [email protected] for API access inquiry.

	## Examples 🧩

	### 1. OCR - Optical Character Recognition for Multi-Images

	Example 01: Citizen identification card

	<div style="display: flex; flex-direction: row; align-items: center; justify-content: center;">
	<div style="text-align: center; margin: 0 10px;">
	<img src="images/trinhquangduy_front.jpg" width="500" alt="Front View" />
	<p>Front View</p>
	</div>
	<div style="text-align: center; margin: 0 10px;">
	<img src="images/trinhquangduy_back.jpg" width="500" alt="Back View" />
	<p>Back View</p>
	</div>
	</div>
	<p style="text-align: center; font-size: 12px; color: gray; margin-top: 10px;">
	Source: <a href="https://support.google.com/google-ads/thread/270967947/t%C3%B4i-%C4%91%C3%A3-g%E1%BB%ADi-h%C3%ACnh-%E1%BA%A3nh-c%C4%83n-c%C6%B0%E1%BB%9Bc-c%C3%B4ng-d%C3%A2n-c%E1%BB%A7a-ch%C3%ADnh-t%C3%B4i-%C4%91%E1%BB%83-x%C3%A1c-minh-danh-t%C3%ADnh?hl=vi" target="_blank">Google Support</a>
	</p>

	```
	{
	"Số thẻ":"037094012351"
	"Họ và tên":"TRỊNH QUANG DUY"
	"Ngày sinh":"04/09/1994"
	"Giới tính":"Nam"
	"Quốc tịch":"Việt Nam"
	"Quê quán / Place of origin":"Tân Thành, Kim Sơn, Ninh Bình"
	"Nơi thường trú / Place of residence":"Xóm 6 Tân Thành, Kim Sơn, Ninh Bình"
	"Có giá trị đến":"04/09/2034"
	"Đặc điểm nhân dạng / Personal identification":"seo chấm c:1cm trên đuôi mắt trái"
	"Cục trưởng cục cảnh sát quản lý hành chính về trật tự xã hội":"Nguyễn Quốc Hùng"
	"Ngày cấp":"10/12/2022"
	}
	```

	Example 01: Identity Card


	<div style="display: flex; flex-direction: row; align-items: center; justify-content: center;">
	<div style="text-align: center; margin: 0 10px;">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/63d8d8879dfcfa941d4d7cd9/4RD71oI0p04n1hAvLnqCR.jpeg" width="500" alt="Front View" />
	<p>Front View</p>
	</div>
	<div style="text-align: center; margin: 0 10px;">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/63d8d8879dfcfa941d4d7cd9/zNOWjwAoS9zEH1wUt6Fh6.jpeg" width="500" alt="Back View" />
	<p>Back View</p>
	</div>
	</div>
	<p style="text-align: center; font-size: 12px; color: gray; margin-top: 10px;">
	Source: <a href="https://lamgiaynhanh.com/lam-giay-chung-minh-nhan-dan-gia-nhanh/" target="_blank">Internet</a>
	</p>

	```
	{
	"Số":"272737384"
	"Họ tên":"PHẠM NHẬT TRƯỜNG"
	"Sinh ngày":"08-08-2000"
	"Nguyên quán":"Tiền Giang"
	"Nơi ĐKHK thường trú":"393, Tân Xuân, Bảo Bình, Cẩm Mỹ, Đồng Nai"
	"Dân tộc":"Kinh"
	"Tôn giáo":"Không"
	"Đặc điểm nhận dạng":"Nốt ruồi c.3,5cm trên sau cánh mũi phải."
	"Ngày cấp":"30 tháng 01 năm 2018"
	"Giám đốc CA":"T.BÌNH ĐỊNH"
	}
	```

	Example 02: Driver's License

	<div style="display: flex; flex-direction: row; align-items: center; justify-content: center;">
	<div style="text-align: center; margin: 0 10px;">
	<img src="images/nguyenvandung_front.png" width="500" alt="Front View" />
	<p>Front View</p>
	</div>
	<div style="text-align: center; margin: 0 10px;">
	<img src="images/nguyenvandung_back.png" width="500" alt="Back View" />
	<p>Back View</p>
	</div>
	</div>
	<p style="text-align: center; font-size: 12px; color: gray; margin-top: 10px;">
	Source: <a href="https://baophapluat.vn/khoi-to-tai-xe-len-mang-mua-giay-phep-lai-xe-gia-de-chay-xe-post481047.html" target="_blank">Báo Pháp luật</a>
	</p>

	```
	{
	"No.":"400116012313"
	"Fullname":"NGUYỄN VĂN DŨNG"
	"Date_of_birth":"08/06/1979"
	"Nationality":"VIỆT NAM"
	"Address":"X. Quỳnh Hầu, H. Quỳnh Lưu, T. Nghệ An
	Nghệ An, ngày/date 23 tháng/month 04 năm/year 2022"
	"Hang_Class":"FC"
	"Expires":"23/04/2027"
	"Place_of_issue":"Nghệ An"
	"Date_of_issue":"ngày/date 23 tháng/month 04 năm/year 2022"
	"Signer":"Trần Anh Tuấn"
	"Các loại xe được phép":"Ô tô hạng C kéo rơmoóc, đầu kéo kéo sơmi rơmoóc và xe hạng B1, B2, C, FB2 (Motor vehicle of class C with a trailer, semi-trailer truck and vehicles of classes B1, B2, C, FB2)"
	"Mã số":""
	}
	```

	Example 03: Vehicle Registration Certificate

	<div style="display: flex; flex-direction: row; align-items: center; justify-content: center;">
	<div style="text-align: center; margin: 0 10px;">
	<img src="images/nguyentonnhuan.jpg" width="500"/>
	</div>
	</div>
	<p style="text-align: center; font-size: 12px; color: gray; margin-top: 10px;">
	Source: <a href="https://vietnamnet.vn/phan-biet-cac-loai-giay-dang-ky-xe-khi-mua-moto-da-qua-su-dung-541341.html" target="_blank">Báo Vietnamnet</a>
	</p>

	```
	{
	"Tên chủ xe":"NGUYỄN TÔN NHUẬN"
	"Địa chỉ":"KE27 Kp3 P.TTTây Q7"
	"Nhãn hiệu":"HONDA"
	"Số loại":"DYLAN"
	"Màu sơn":"Trắng"
	"Số người được phép chở":"02"
	"Nguồn gốc":"Xe nhập mới"
	"Biển số đăng ký":"59V1-498.89"
	"Đăng ký lần đầu ngày":"08/06/2004"
	"Số máy":"F03E-0057735"
	"Số khung":"5A04F-070410"
	"Dung tích":"152"
	"Quản lý":"TRƯỞNG CA QUẬN"
	"Thượng tá":"Trần Văn Hiểu"
	}
	```

	Example 04: Birth Certificate

	<div style="display: flex; flex-direction: row; align-items: center; justify-content: center;">
	<div style="text-align: center; margin: 0 10px;">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/63d8d8879dfcfa941d4d7cd9/nVy1v3bwBl5lP9fZIIeux.jpeg" width="500"/>
	</div>
	</div>
	<p style="text-align: center; font-size: 12px; color: gray; margin-top: 10px;">
	Source: <a href="https://congchung247.com.vn/giay-khai-sinh-ban-chinh-co-the-lam-lai-duoc-khong/" target="_blank">https://congchung247.com.vn</a>
	</p>

	```
	{
	"name": "NGUYỄN NAM PHƯƠNG",
	"gender": "Nữ",
	"date_of_birth": "08/6/2011",
	"place_of_birth": "Bệnh viện Việt - Pháp Hà Nội",
	"nationality": "Việt Nam",
	"father_name": "Nguyễn Ninh Hồng Quang",
	"father_dob": "1980",
	"father_address": "309 nhà E2 Bạch Khoa - Hai Bà Trưng - Hà Nội",
	"mother_name": "Phạm Thùy Trang",
	"mother_dob": "1984",
	"mother_address": "309 nhà E2 Bạch Khoa - Hai Bà Trưng - Hà Nội",
	"registration_place": "UBND phường Bạch Khoa - Quận Hai Bà Trưng - Hà Nội",
	"registration_date": "05/8/2011",
	"registration_ralation": "cha",
	"notes": None,
	"certified_by": "Nguyễn Thị Kim Hoa"
	}
	```

	## Quickstart 🎮

	Install the necessary packages:
	```curl
	python -m pip install git+https://github.com/huggingface/transformers accelerate
	python -m pip install qwen-vl-utils
	pip install flash-attn --no-build-isolation
	```

	Then you can use `EraX-VL-2B-V1.5` like this:
	```python
	import os
	import base64
	import json

	import cv2
	import numpy as np
	import matplotlib.pyplot as plt

	import torch
	from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
	from qwen_vl_utils import process_vision_info

	model_path = "erax/EraX-VL-2B-V1.5"

	model = Qwen2VLForConditionalGeneration.from_pretrained(
	model_path,
	torch_dtype=torch.bfloat16,
	attn_implementation="eager", # replace with "flash_attention_2" if your GPU is Ampere architecture
	device_map="auto"
	)

	tokenizer = AutoTokenizer.from_pretrained(model_path)
	# processor = AutoProcessor.from_pretrained(model_path)

	min_pixels = 256 * 28 * 28
	max_pixels = 1280 * 28 * 28
	processor = AutoProcessor.from_pretrained(
	model_path,
	min_pixels=min_pixels,
	max_pixels=max_pixels,
	)

	image_path ="image.jpg"

	with open(image_path, "rb") as f:
	encoded_image = base64.b64encode(f.read())
	decoded_image_text = encoded_image.decode('utf-8')
	base64_data = f"data:image;base64,{decoded_image_text}"

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": base64_data,
	},
	{
	"type": "text",
	"text": "Trích xuất thông tin nội dung từ hình ảnh được cung cấp."
	},
	],
	}
	]

	# Prepare prompt
	tokenized_text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)

	image_inputs, video_inputs = process_vision_info(messages)

	inputs = processor(
	text=[ tokenized_text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")

	# Generation configs
	generation_config = model.generation_config
	generation_config.do_sample = True
	generation_config.temperature = 1.0
	generation_config.top_k = 1
	generation_config.top_p = 0.9
	generation_config.min_p = 0.1
	generation_config.best_of = 5
	generation_config.max_new_tokens = 2048
	generation_config.repetition_penalty = 1.06

	# Inference
	generated_ids = model.generate(**inputs, generation_config=generation_config)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)

	print(output_text[0])
	```

	## References 📑
	[1] Qwen team. Qwen2-VL. 2024.

	[2] Bai, Jinze, et al. "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond." arXiv preprint arXiv:2308.12966 (2023).

	[4] Yang, An, et al. "Qwen2 technical report." arXiv preprint arXiv:2407.10671 (2024).

	[5] Chen, Zhe, et al. "Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

	[6] Chen, Zhe, et al. "How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites." arXiv preprint arXiv:2404.16821 (2024).

	[7] Tran, Chi, and Huong Le Thanh. "LaVy: Vietnamese Multimodal Large Language Model." arXiv preprint arXiv:2404.07922 (2024).


	## Contact 🤝
	- For correspondence regarding this work or inquiry for API trial, please contact Nguyễn Anh Nguyên at [[email protected]]([email protected]).
	- Follow us on <b><a href="https://github.com/EraX-JS-Company" target="_blank">EraX Github</a></b>