Model Card for Shrav20/colloquial-tamil-mt

📌 Model Summary

This model is a Machine Translation (MT) model designed for converting English to colloquial Tamil and vice versa. Unlike traditional Tamil MT models, which focus on formal Tamil, this model generates translations in natural spoken Tamil commonly used in everyday conversations.

📊 Model Details

Developed by: Shrav20
Funded by: Independent
Shared by: Shrav20
Model Type: Sequence-to-Sequence (Seq2Seq) Translation
Architecture: Based on M2M100 (Facebook’s Multilingual MT Model), finetuned for colloquial Tamil.
Languages Supported:
- English → Tamil (Colloquial)
- Tamil (Colloquial) → English
License: MIT
Finetuned from: facebook/m2m100_418M

🛠 Model Usage

🔹 Direct Use

You can use this model for colloquial Tamil translation in conversational AI, subtitles, and chatbots.

Example Code:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Shrav20/colloquial-tamil-mt")
model = AutoModelForSeq2SeqLM.from_pretrained("Shrav20/colloquial-tamil-mt")

def translate(text, src_lang="en", tgt_lang="ta"):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    output = model.generate(**inputs)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Example Translation
print(translate("The pharmacy is near the bus stop. "))  # Output: "Bus stop pakkathula pharmacy iruku."

📖 Training Details

📌 Training Dataset

This model is finetuned on Shrav20/colloquial-tamil dataset.
Sources:
- sangeethat/colloquial
- AI-generated data
- Internet-scraped content
- Manually verified colloquial sentences

🛠 Training Hyperparameters

Batch Size: 16
Learning Rate: 5e-5
Epochs: 3
Optimizer: AdamW
Precision: fp16 (mixed precision)
LoRA Adapters: Enabled for efficient fine-tuning

📊 Evaluation

📌 Testing Data & Metrics

Dataset: 5,000 colloquial Tamil-English sentence pairs
Evaluation Metrics:
- BLEU Score: 28.5
- METEOR Score: 34.1
- TER: 41.2

📌 Example Outputs

English	Tamil (Colloquial)
The pharmacy is near the bus stop.	Bus stop pakkathula pharmacy iruku.
Take this medicine after food.	Food saptadhukku apram intha medicine eduthukungo.
Train tickets for tomorrow are available.	Naalaikku train tickets available iruku.

🚨 Bias, Risks, and Limitations

Dialectal Bias: The model is trained on a specific style of spoken Tamil and may not generalize to all Tamil dialects.
Data Noise: Some AI-generated content may not be fully accurate.
Context Sensitivity: Model struggles with complex sentence structures and ambiguous meanings.

💡 How to Contribute

If you find issues or have improvements, feel free to open a GitHub issue or contribute data via Hugging Face!

📩 Contact: Shrav20 via Hugging Face discussions.

📝 Citation

If you use this model, please cite:

@misc{shrav20colloquial,
  author = {Shrav20},
  title = {Colloquial Tamil Machine Translation Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Shrav20/colloquial-tamil-mt}
}

🌱 Future Improvements

✅ More diverse datasets ✅ Better handling of Tamil-English code-mixing ✅ Improved sentence fluency with RLHF (Reinforcement Learning with Human Feedback)

Shrav20
/

colloquial-tamil-mt