Thirawarit
commited on
Commit
•
44203fd
1
Parent(s):
96ea63a
Update README.md
Browse files
README.md
CHANGED
@@ -33,7 +33,7 @@ Pathumma-llm-vision-1.0.0 is designed to perform multi-modal tasks by integratin
|
|
33 |
|
34 |
## Training Data
|
35 |
The model was fine-tuned on several datasets:
|
36 |
-
- **Image Caption
|
37 |
- **Thai Shorthand Dataset**: Data related to the Thai language.
|
38 |
- **ShareGPT-4o (translated into Thai)**: Data translated from GPT-4o-mini outputs into Thai.
|
39 |
- **Small-Thai-Wikipedia-location**: Articles in Thai from Wikipedia about geographic locations.
|
@@ -53,11 +53,11 @@ The model was fine-tuned on several datasets:
|
|
53 |
|
54 |
## Evaluation Results
|
55 |
|
56 |
-
| Type | Encoder | Decoder |
|
57 |
-
|
58 |
-
| Idefic3-8B-Llama3 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct |
|
59 |
-
| Pathumma-llm-vision-beta-0.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct |
|
60 |
-
| Pathumma-llm-vision-1.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct |
|
61 |
|
62 |
|
63 |
- **Accuracy on Manual-VQA Tasks**: 30.34%
|
@@ -71,10 +71,34 @@ pip install git+https://github.com/andimarafioti/transformers.git@idefics3
|
|
71 |
```
|
72 |
|
73 |
## Usage
|
|
|
74 |
To use the model with the Hugging Face `transformers` library:
|
75 |
|
76 |
```python
|
77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
78 |
|
79 |
DEVICE = f"cuda" if torch.cuda.is_available() else 'cpu' if torch.cpu.is_available() else 'mps'
|
80 |
print(DEVICE)
|
@@ -82,8 +106,10 @@ if DEVICE == 'cuda': display(torch.cuda.device_count())
|
|
82 |
|
83 |
N = 5
|
84 |
|
|
|
85 |
processor = AutoProcessor.from_pretrained(
|
86 |
"nectec/Pathumma-llm-vision-1.0.0",
|
|
|
87 |
do_image_splitting=False,
|
88 |
# size={"longest_edge": N*364}, # Optional
|
89 |
# size={"height": N*364, "width": N*364}, # Optional
|
@@ -91,6 +117,7 @@ processor = AutoProcessor.from_pretrained(
|
|
91 |
|
92 |
model = Idefics3ForConditionalGeneration.from_pretrained(
|
93 |
"nectec/Pathumma-llm-vision-1.0.0",
|
|
|
94 |
torch_dtype=torch.float16,
|
95 |
device_map=DEVICE
|
96 |
)
|
@@ -152,7 +179,11 @@ answer_prompt = generated_text.split('Assistant:')[1].strip()
|
|
152 |
|
153 |
# Output processing (depends on task requirements)
|
154 |
print(answer_prompt)
|
155 |
-
print(latency_time)
|
|
|
|
|
|
|
|
|
156 |
```
|
157 |
|
158 |
## Limitations and Biases
|
@@ -168,13 +199,24 @@ If you use this model, please cite it as follows:
|
|
168 |
|
169 |
```bibtex
|
170 |
@misc{PathummaVision,
|
171 |
-
author = {NECTEC Team},
|
172 |
title = {nectec/Pathumma-llm-vision-1.0.0},
|
173 |
year = {2024},
|
174 |
url = {https://huggingface.co/nectec/Pathumma-llm-vision-1.0.0}
|
175 |
}
|
176 |
```
|
177 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
178 |
## Contact
|
179 |
For questions or support, please contact **https://discord.gg/3WJwJjZt7r**.
|
180 |
|
|
|
33 |
|
34 |
## Training Data
|
35 |
The model was fine-tuned on several datasets:
|
36 |
+
- **Thai Image Caption**: Data sourced from image captioning competitions on Kaggle.
|
37 |
- **Thai Shorthand Dataset**: Data related to the Thai language.
|
38 |
- **ShareGPT-4o (translated into Thai)**: Data translated from GPT-4o-mini outputs into Thai.
|
39 |
- **Small-Thai-Wikipedia-location**: Articles in Thai from Wikipedia about geographic locations.
|
|
|
53 |
|
54 |
## Evaluation Results
|
55 |
|
56 |
+
| Type | Encoder | Decoder | Sentence SacreBLEU <br>(test) | Unique Tokens |
|
57 |
+
|---------------------------------------|------------------------------------|--------------------------------|-------------------------------|---------------|
|
58 |
+
| Idefic3-8B-Llama3 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 0.02657 | 12990 |
|
59 |
+
| Pathumma-llm-vision-beta-0.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 13.45412 | 1148 |
|
60 |
+
| Pathumma-llm-vision-1.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 17.66370 | 1312 |
|
61 |
|
62 |
|
63 |
- **Accuracy on Manual-VQA Tasks**: 30.34%
|
|
|
71 |
```
|
72 |
|
73 |
## Usage
|
74 |
+
We provide a [inference tutorial](https://colab.research.google.com/drive/1TakNg4v6hHFXLih-SFcibxzYBTs2-EFn?usp=sharing).
|
75 |
To use the model with the Hugging Face `transformers` library:
|
76 |
|
77 |
```python
|
78 |
+
import io
|
79 |
+
import os
|
80 |
+
import time
|
81 |
+
import random
|
82 |
+
import requests
|
83 |
+
import shutil
|
84 |
+
from IPython.display import display, Markdown
|
85 |
+
from IPython.display import clear_output as cls
|
86 |
+
|
87 |
+
import numpy as np
|
88 |
+
import pandas as pd
|
89 |
+
from PIL import Image
|
90 |
+
|
91 |
+
import torch
|
92 |
+
|
93 |
+
import transformers
|
94 |
+
from transformers import (
|
95 |
+
Idefics3ForConditionalGeneration,
|
96 |
+
AutoProcessor,
|
97 |
+
BitsAndBytesConfig,
|
98 |
+
)
|
99 |
+
```
|
100 |
+
|
101 |
+
```python
|
102 |
|
103 |
DEVICE = f"cuda" if torch.cuda.is_available() else 'cpu' if torch.cpu.is_available() else 'mps'
|
104 |
print(DEVICE)
|
|
|
106 |
|
107 |
N = 5
|
108 |
|
109 |
+
revision = "quantized8bit"
|
110 |
processor = AutoProcessor.from_pretrained(
|
111 |
"nectec/Pathumma-llm-vision-1.0.0",
|
112 |
+
revision=revision, # Optional
|
113 |
do_image_splitting=False,
|
114 |
# size={"longest_edge": N*364}, # Optional
|
115 |
# size={"height": N*364, "width": N*364}, # Optional
|
|
|
117 |
|
118 |
model = Idefics3ForConditionalGeneration.from_pretrained(
|
119 |
"nectec/Pathumma-llm-vision-1.0.0",
|
120 |
+
revision=revision, # Optional
|
121 |
torch_dtype=torch.float16,
|
122 |
device_map=DEVICE
|
123 |
)
|
|
|
179 |
|
180 |
# Output processing (depends on task requirements)
|
181 |
print(answer_prompt)
|
182 |
+
print(f"latency_time: {latency_time:.3f} sec.")
|
183 |
+
|
184 |
+
# >>> output:
|
185 |
+
# >>> ลูกฮิปโปแคระกำลังยืนอยู่ข้างแม่ฮิปโปแคระที่กำลังอาบน้ำ
|
186 |
+
# >>> latency_time: 7.642 sec.
|
187 |
```
|
188 |
|
189 |
## Limitations and Biases
|
|
|
199 |
|
200 |
```bibtex
|
201 |
@misc{PathummaVision,
|
202 |
+
author = {Thirawarit Pitiphiphat and NECTEC Team},
|
203 |
title = {nectec/Pathumma-llm-vision-1.0.0},
|
204 |
year = {2024},
|
205 |
url = {https://huggingface.co/nectec/Pathumma-llm-vision-1.0.0}
|
206 |
}
|
207 |
```
|
208 |
|
209 |
+
```bibtex
|
210 |
+
@misc{laurençon2024building,
|
211 |
+
title={Building and better understanding vision-language models: insights and future directions.},
|
212 |
+
author={Hugo Laurençon and Andrés Marafioti and Victor Sanh and Léo Tronchon},
|
213 |
+
year={2024},
|
214 |
+
eprint={2408.12637},
|
215 |
+
archivePrefix={arXiv},
|
216 |
+
primaryClass={cs.CV}
|
217 |
+
}
|
218 |
+
```
|
219 |
+
|
220 |
## Contact
|
221 |
For questions or support, please contact **https://discord.gg/3WJwJjZt7r**.
|
222 |
|