Thirawarit commited on
Commit
44203fd
1 Parent(s): 96ea63a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -9
README.md CHANGED
@@ -33,7 +33,7 @@ Pathumma-llm-vision-1.0.0 is designed to perform multi-modal tasks by integratin
33
 
34
  ## Training Data
35
  The model was fine-tuned on several datasets:
36
- - **Image Caption Competition (Kaggle)**: Data sourced from image captioning competitions on Kaggle.
37
  - **Thai Shorthand Dataset**: Data related to the Thai language.
38
  - **ShareGPT-4o (translated into Thai)**: Data translated from GPT-4o-mini outputs into Thai.
39
  - **Small-Thai-Wikipedia-location**: Articles in Thai from Wikipedia about geographic locations.
@@ -53,11 +53,11 @@ The model was fine-tuned on several datasets:
53
 
54
  ## Evaluation Results
55
 
56
- | Type | Encoder | Decoder | Learning Rate | Sentence SacreBLEU | Unique Tokens |
57
- |---------------------------------------|------------------------------------|--------------------------------|---------------|--------------------|---------------|
58
- | Idefic3-8B-Llama3 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | - | 0.02657 | 12990 |
59
- | Pathumma-llm-vision-beta-0.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 1e-4 | 13.45412 | 1148 |
60
- | Pathumma-llm-vision-1.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 1e-4 | 17.66370 | 1312 |
61
 
62
 
63
  - **Accuracy on Manual-VQA Tasks**: 30.34%
@@ -71,10 +71,34 @@ pip install git+https://github.com/andimarafioti/transformers.git@idefics3
71
  ```
72
 
73
  ## Usage
 
74
  To use the model with the Hugging Face `transformers` library:
75
 
76
  ```python
77
- from transformers import AutoProcessor, Idefics3ForConditionalGeneration
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
  DEVICE = f"cuda" if torch.cuda.is_available() else 'cpu' if torch.cpu.is_available() else 'mps'
80
  print(DEVICE)
@@ -82,8 +106,10 @@ if DEVICE == 'cuda': display(torch.cuda.device_count())
82
 
83
  N = 5
84
 
 
85
  processor = AutoProcessor.from_pretrained(
86
  "nectec/Pathumma-llm-vision-1.0.0",
 
87
  do_image_splitting=False,
88
  # size={"longest_edge": N*364}, # Optional
89
  # size={"height": N*364, "width": N*364}, # Optional
@@ -91,6 +117,7 @@ processor = AutoProcessor.from_pretrained(
91
 
92
  model = Idefics3ForConditionalGeneration.from_pretrained(
93
  "nectec/Pathumma-llm-vision-1.0.0",
 
94
  torch_dtype=torch.float16,
95
  device_map=DEVICE
96
  )
@@ -152,7 +179,11 @@ answer_prompt = generated_text.split('Assistant:')[1].strip()
152
 
153
  # Output processing (depends on task requirements)
154
  print(answer_prompt)
155
- print(latency_time)
 
 
 
 
156
  ```
157
 
158
  ## Limitations and Biases
@@ -168,13 +199,24 @@ If you use this model, please cite it as follows:
168
 
169
  ```bibtex
170
  @misc{PathummaVision,
171
- author = {NECTEC Team},
172
  title = {nectec/Pathumma-llm-vision-1.0.0},
173
  year = {2024},
174
  url = {https://huggingface.co/nectec/Pathumma-llm-vision-1.0.0}
175
  }
176
  ```
177
 
 
 
 
 
 
 
 
 
 
 
 
178
  ## Contact
179
  For questions or support, please contact **https://discord.gg/3WJwJjZt7r**.
180
 
 
33
 
34
  ## Training Data
35
  The model was fine-tuned on several datasets:
36
+ - **Thai Image Caption**: Data sourced from image captioning competitions on Kaggle.
37
  - **Thai Shorthand Dataset**: Data related to the Thai language.
38
  - **ShareGPT-4o (translated into Thai)**: Data translated from GPT-4o-mini outputs into Thai.
39
  - **Small-Thai-Wikipedia-location**: Articles in Thai from Wikipedia about geographic locations.
 
53
 
54
  ## Evaluation Results
55
 
56
+ | Type | Encoder | Decoder | Sentence SacreBLEU <br>(test) | Unique Tokens |
57
+ |---------------------------------------|------------------------------------|--------------------------------|-------------------------------|---------------|
58
+ | Idefic3-8B-Llama3 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 0.02657 | 12990 |
59
+ | Pathumma-llm-vision-beta-0.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 13.45412 | 1148 |
60
+ | Pathumma-llm-vision-1.0.0 | siglip-so400m-patch14-384 | Meta-Llama-3.1-8B-Instruct | 17.66370 | 1312 |
61
 
62
 
63
  - **Accuracy on Manual-VQA Tasks**: 30.34%
 
71
  ```
72
 
73
  ## Usage
74
+ We provide a [inference tutorial](https://colab.research.google.com/drive/1TakNg4v6hHFXLih-SFcibxzYBTs2-EFn?usp=sharing).
75
  To use the model with the Hugging Face `transformers` library:
76
 
77
  ```python
78
+ import io
79
+ import os
80
+ import time
81
+ import random
82
+ import requests
83
+ import shutil
84
+ from IPython.display import display, Markdown
85
+ from IPython.display import clear_output as cls
86
+
87
+ import numpy as np
88
+ import pandas as pd
89
+ from PIL import Image
90
+
91
+ import torch
92
+
93
+ import transformers
94
+ from transformers import (
95
+ Idefics3ForConditionalGeneration,
96
+ AutoProcessor,
97
+ BitsAndBytesConfig,
98
+ )
99
+ ```
100
+
101
+ ```python
102
 
103
  DEVICE = f"cuda" if torch.cuda.is_available() else 'cpu' if torch.cpu.is_available() else 'mps'
104
  print(DEVICE)
 
106
 
107
  N = 5
108
 
109
+ revision = "quantized8bit"
110
  processor = AutoProcessor.from_pretrained(
111
  "nectec/Pathumma-llm-vision-1.0.0",
112
+ revision=revision, # Optional
113
  do_image_splitting=False,
114
  # size={"longest_edge": N*364}, # Optional
115
  # size={"height": N*364, "width": N*364}, # Optional
 
117
 
118
  model = Idefics3ForConditionalGeneration.from_pretrained(
119
  "nectec/Pathumma-llm-vision-1.0.0",
120
+ revision=revision, # Optional
121
  torch_dtype=torch.float16,
122
  device_map=DEVICE
123
  )
 
179
 
180
  # Output processing (depends on task requirements)
181
  print(answer_prompt)
182
+ print(f"latency_time: {latency_time:.3f} sec.")
183
+
184
+ # >>> output:
185
+ # >>> ลูกฮิปโปแคระกำลังยืนอยู่ข้างแม่ฮิปโปแคระที่กำลังอาบน้ำ
186
+ # >>> latency_time: 7.642 sec.
187
  ```
188
 
189
  ## Limitations and Biases
 
199
 
200
  ```bibtex
201
  @misc{PathummaVision,
202
+ author = {Thirawarit Pitiphiphat and NECTEC Team},
203
  title = {nectec/Pathumma-llm-vision-1.0.0},
204
  year = {2024},
205
  url = {https://huggingface.co/nectec/Pathumma-llm-vision-1.0.0}
206
  }
207
  ```
208
 
209
+ ```bibtex
210
+ @misc{laurençon2024building,
211
+ title={Building and better understanding vision-language models: insights and future directions.},
212
+ author={Hugo Laurençon and Andrés Marafioti and Victor Sanh and Léo Tronchon},
213
+ year={2024},
214
+ eprint={2408.12637},
215
+ archivePrefix={arXiv},
216
+ primaryClass={cs.CV}
217
+ }
218
+ ```
219
+
220
  ## Contact
221
  For questions or support, please contact **https://discord.gg/3WJwJjZt7r**.
222