ThaiT5-Instruct

Model Description

ThaiT5-Instruct is a fine-tuned version of kobkrit/thai-t5-base, trained on the WangchanX Seed-Free Synthetic Instruct Thai 120k dataset. This model supports various NLP tasks, including:

Conversation
Multiple Choice Reasoning
Brainstorming
Question Answering
Summarization

The model has been trained for 13 epochs and can be further improved with more resources.

Training Details

Base Model: kobkrit/thai-t5-base
Epochs: 13
Batch Size per Device: 32
Gradient Accumulation Steps: 2
Optimizer: AdamW
Hardware Used: A100

Training Loss per Epoch:

[2.2463, 1.7010, 1.5261, 1.4626, 1.4085, 1.3844, 1.3647, 1.3442, 1.3373, 1.3182, 1.3169, 1.3016]

Validation Loss per Epoch:

[1.4781, 1.3761, 1.3131, 1.2775, 1.2549, 1.2364, 1.2226, 1.2141, 1.2043, 1.1995, 1.1954, 1.1929]

Evaluation Results

The model was evaluated using several NLP metrics, with the following results:

Metric	Score
ROUGE-1	0.0617
ROUGE-2	0.0291
ROUGE-L	0.061
BLEU	0.0093
Exact Match	0.2516
F1 Score	27.8984

Usage

Basic Inference (Without Context)

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained("Peenipat/ThaiT5-Instruct", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Peenipat/ThaiT5-Instruct")

input_text = "หวัดดี"

inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"])
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(output_text)

Example:

input_text = "คำว่า ฮัก หมายถึงอะไร"

inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(input_ids=inputs["input_ids"])
output_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(output_text)

Output:

"ฮัก หมายถึง ภาษา สันสกฤต ภาษา สันสกฤต "

Question Answering (With Context)

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, pipeline

model = AutoModelForSeq2SeqLM.from_pretrained("Peenipat/ThaiT5-Instruct", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("Peenipat/ThaiT5-Instruct")

model.eval()
qa_pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer)

def ask_question():
    context = input("Input Context: ")
    question = input("Input Question: ")
    input_text = f"Context: {context} Question: {question}"
    output = qa_pipeline(input_text,
                         max_length=60,
                         min_length=20,
                         no_repeat_ngram_size=3,
                         num_beams=5,
                         early_stopping=True)
    output_text = output[0]['generated_text']
    print("\nOutput:")
    print(output_text)

Example:

Input Context: ฮัก คือความรู้สึกผูกพันและห่วงใยที่เกิดขึ้นระหว่างคนที่มีความสำคัญต่อกัน ไม่ว่าจะเป็นฮักหนุ่มสาว ฮักพ่อแม่ลูก หรือฮักพี่น้อง ฮักบ่ได้หมายถึงแค่ความสุข แต่ยังรวมถึงความเข้าใจ การอดทน และการเสียสละเพื่อกันและกัน คนอีสานมักแสดงความฮักผ่านการกระทำมากกว่าคำพูด เช่น การดูแลเอาใจใส่ และการอยู่เคียงข้างยามทุกข์ยาก ฮักแท้คือฮักที่มั่นคง บ่เปลี่ยนแปลงตามกาลเวลา และเต็มไปด้วยความจริงใจ
Input Question: คำว่า ฮัก หมายถึงอะไร

Output:
ฮัก ความรู้สึกผูกพันและห่วงใย เกิดขึ้นระหว่างคนมีความสําคัญต่อกัน ฮักบ่ได้หมายถึงความสุข ความเข้าใจ การอดทน เสียสละเพื่อกันและกัน คนอีสานมักแสดงความฮักผ่านการกระทํามากกว่าคําพูด ดูแลเอาใจใส่ ที่อยู่เคียงข้างยามทุกข์

Limitations & Future Improvements

The model can be further improved with additional training resources.
Performance on complex reasoning tasks may require further fine-tuning on domain-specific datasets.
The model does not possess general intelligence like ChatGPT, Gemini, or other advanced AI models. It excels at extracting answers from given contexts rather than generating knowledge independently.

Citation

If you use this model, please cite it as follows:

@misc{PeenipatThaiT5Instruct,
  title={ThaiT5-Instruct},
  author={Peenipat},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/Peenipat/ThaiT5-Instruct}
}

Peenipat
/

ThaiT5-Instruct