metadata

language:
  - th
  - en
metrics:
  - sacrebleu
base_model:
  - HuggingFaceM4/Idefics3-8B-Llama3
pipeline_tag: visual-question-answering

Pathumma-llm-vision-1.0.0

Model Overview

Pathumma-llm-vision-1.0.0 is a multi-modal language model fine-tuned for Visual Question Answering (VQA) and Image Captioning tasks. It contains 8 billion parameters and leverages both image and text processing to understand and generate multi-modal content.

Model Name: Pathumma-llm-vision-1.0.0
Base Model: HuggingFaceM4/Idefics3-8B-Llama3
Architecture: Multi-modal LLM (Visual Language Model)
Parameters: 8 Billion
Organization: NECTEC
License: [Specify License]

Intended Use

Primary Use Cases:
- Visual Question Answering (VQA)
- Image Captioning
Intended Users: Developers, researchers, and AI practitioners working on multi-modal tasks.
Possible Applications: Educational tools, accessibility applications, interactive visual content generation.

Model Description

Pathumma-llm-vision-1.0.0 is designed to perform multi-modal tasks by integrating both visual and textual information. The model is fine-tuned with diverse datasets to improve its ability to understand and generate content that aligns with both image and text inputs.

Training Data

The model was fine-tuned on several datasets:

Thai Image Caption: Data sourced from image captioning competitions on Kaggle.
Thai Shorthand Dataset: Data related to the Thai language.
ShareGPT-4o (translated into Thai): Data translated from GPT-4o-mini outputs into Thai.
Small-Thai-Wikipedia-location: Articles in Thai from Wikipedia about geographic locations.
Synthetic Data: Additional synthetic data generated to increase dataset diversity.

Dataset Size

Training Dataset Size: 112,768 examples
Validation Dataset Size: 9,036 examples

Training Details

Hardware Used:
- HPC Cluster: Lanta
- Number of Nodes: 16 Nodes
- GPUs per Node: 4 GPUs
- Total GPUs Used: 64 GPUs
Fine-tuning Duration: 3 hours, 18 minutes, and 11 seconds (excluding evaluation)

Evaluation Results

Type	Encoder	Decoder	IPU24-dataset (test) (Sentence SacreBLEU)
Idefic3-8B-Llama3	siglip-so400m-patch14-384	Meta-Llama-3.1-8B-Instruct	0.02657
Pathumma-llm-vision-beta-0.0.0	siglip-so400m-patch14-384	Meta-Llama-3.1-8B-Instruct	13.45412
Pathumma-llm-vision-1.0.0	siglip-so400m-patch14-384	Meta-Llama-3.1-8B-Instruct	17.66370
llama-3-typhoon-v1.5-8b-vision-preview	siglip-so400m-patch14-384	Llama-3-Typhoon-1.5-8B-instruct	8.288626

Note: Other models not target fine-tuned on IPU24-datasets may be less representative of IPU24 performance.

Accuracy on VQA Tasks with testing a private dataset: 30.34%

Required Libraries

Before you start, ensure you have the following libraries installed:

pip install git+https://github.com/andimarafioti/transformers.git@idefics3

Usage

We provide a inference tutorial. To use the model with the Hugging Face transformers library:

import io
import os
import time
import random
import requests
import shutil
from IPython.display import display, Markdown
from IPython.display import clear_output as cls

import numpy as np
import pandas as pd
from PIL import Image

import torch

import transformers
from transformers import (
    Idefics3ForConditionalGeneration,
    AutoProcessor,
    BitsAndBytesConfig,
)


DEVICE = f"cuda" if torch.cuda.is_available() else 'cpu' if torch.cpu.is_available() else 'mps'
print(DEVICE)
if DEVICE == 'cuda': display(torch.cuda.device_count())

N = 5

revision = "quantized8bit"
processor = AutoProcessor.from_pretrained(
    "nectec/Pathumma-llm-vision-1.0.0",
    revision=revision,                         # Optional
    do_image_splitting=False,
    # size={"longest_edge": N*364},            # Optional
    # size={"height": N*364, "width": N*364},  # Optional
)

model = Idefics3ForConditionalGeneration.from_pretrained(
        "nectec/Pathumma-llm-vision-1.0.0",
        revision=revision,                     # Optional
        torch_dtype=torch.float16,
        device_map=DEVICE
    )

print(processor.image_processor.size)

url_path = None
local_path = "./path/picture.jpg" if not url_path else io.BytesIO(requests.get(url_path).content)
image = Image.open(local_path)

question = "รายละเอียดของรูปภาพนี้"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "You are a helpful assistant."},
            {"type": "image"},
            {"type": "text", "text": question}
        ]
    }
]

text = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
)

encoding = processor(
    images=image,
    text=text.strip(),
    # padding='max_length',
    # truncation=True,
    # max_length=,
    return_tensors="pt"
)

encoding = {k: v.to(DEVICE) for k, v in encoding.items()}

# Example: Run inference on text input
start_time = time.time()
model.eval()
with torch.inference_mode():
    # Generate
    generated_ids = model.generate(
        **inputs, 
        max_new_tokens=128, 
        # temperature=.5, 
        # repetition_penalty=1.,
        # # top_k=1.,
        # top_p=1,
    )
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
end_time = time.time()

## Get letency_time...
latency_time = end_time - start_time

answer_prompt = generated_text.split('Assistant:')[1].strip()

# Output processing (depends on task requirements)
print(answer_prompt)
print(f"latency_time: {latency_time:.3f} sec.")

# >>> output:
# >>> ลูกฮิปโปแคระกำลังยืนอยู่ข้างแม่ฮิปโปแคระที่กำลังอาบน้ำ
# >>> latency_time: 7.642 sec.

Limitations and Biases

The model may exhibit biases due to the training data, which might not be fully representative of all contexts.
Performance may degrade on unfamiliar images or non-standard question formats.

Ethical Considerations

The model should not be used to generate misleading information or in ways that violate privacy.
Consider fairness and minimize bias when using the model for language and image processing tasks.

Citation

If you use this model, please cite it as follows:

@misc{PathummaVision,
  author = {Thirawarit Pitiphiphat and NECTEC Team},
  title = {nectec/Pathumma-llm-vision-1.0.0},
  year = {2024},
  url = {https://huggingface.co/nectec/Pathumma-llm-vision-1.0.0}
}

@misc{laurençon2024building,
      title={Building and better understanding vision-language models: insights and future directions.}, 
      author={Hugo Laurençon and Andrés Marafioti and Victor Sanh and Léo Tronchon},
      year={2024},
      eprint={2408.12637},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Contributor Contract

LLM Team
Pakawat Phasook ([email protected])
Jessada Pranee ([email protected])
Arnon Saeoung ([email protected])
Kun Kerdthaisong ([email protected])
Kittisak Sukhantharat ([email protected])
Chaianun Damrongrat ([email protected])
Sarawoot Kongyoung ([email protected])

Audio Team
Pattara Tipaksorn ([email protected])
Wayupuk Sommuang ([email protected])
Oatsada Chatthong ([email protected])
Kwanchiva Thangthai ([email protected])

Vision Team
Thirawarit Pitiphiphat ([email protected])
Peerapas Ngokpon ([email protected])
Theerasit Issaranon ([email protected])

Contact

For questions or support, please contact https://discord.gg/3WJwJjZt7r.

This formatting provides a clean, structured, and readable Markdown layout for these sections. Let me know if further adjustments are needed!