Model Card for Shrav20/colloquial-tamil-mt
π Model Summary
This model is a Machine Translation (MT) model designed for converting English to colloquial Tamil and vice versa. Unlike traditional Tamil MT models, which focus on formal Tamil, this model generates translations in natural spoken Tamil commonly used in everyday conversations.
π Model Details
- Developed by: Shrav20
- Funded by: Independent
- Shared by: Shrav20
- Model Type: Sequence-to-Sequence (Seq2Seq) Translation
- Architecture: Based on M2M100 (Facebookβs Multilingual MT Model), finetuned for colloquial Tamil.
- Languages Supported:
English β Tamil (Colloquial)
Tamil (Colloquial) β English
- License: MIT
- Finetuned from:
facebook/m2m100_418M
π Model Usage
πΉ Direct Use
You can use this model for colloquial Tamil translation in conversational AI, subtitles, and chatbots.
Example Code:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Shrav20/colloquial-tamil-mt")
model = AutoModelForSeq2SeqLM.from_pretrained("Shrav20/colloquial-tamil-mt")
def translate(text, src_lang="en", tgt_lang="ta"):
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
output = model.generate(**inputs)
return tokenizer.decode(output[0], skip_special_tokens=True)
# Example Translation
print(translate("The pharmacy is near the bus stop. ")) # Output: "Bus stop pakkathula pharmacy iruku."
π Training Details
π Training Dataset
- This model is finetuned on Shrav20/colloquial-tamil dataset.
- Sources:
sangeethat/colloquial
- AI-generated data
- Internet-scraped content
- Manually verified colloquial sentences
π Training Hyperparameters
Batch Size:
16Learning Rate:
5e-5Epochs:
3Optimizer:
AdamWPrecision:
fp16 (mixed precision)LoRA Adapters:
Enabled for efficient fine-tuning
π Evaluation
π Testing Data & Metrics
- Dataset: 5,000 colloquial Tamil-English sentence pairs
- Evaluation Metrics:
- BLEU Score: 28.5
- METEOR Score: 34.1
- TER: 41.2
π Example Outputs
English | Tamil (Colloquial) |
---|---|
The pharmacy is near the bus stop. | Bus stop pakkathula pharmacy iruku. |
Take this medicine after food. | Food saptadhukku apram intha medicine eduthukungo. |
Train tickets for tomorrow are available. | Naalaikku train tickets available iruku. |
π¨ Bias, Risks, and Limitations
- Dialectal Bias: The model is trained on a specific style of spoken Tamil and may not generalize to all Tamil dialects.
- Data Noise: Some AI-generated content may not be fully accurate.
- Context Sensitivity: Model struggles with complex sentence structures and ambiguous meanings.
π‘ How to Contribute
- If you find issues or have improvements, feel free to open a GitHub issue or contribute data via Hugging Face!
π© Contact: Shrav20 via Hugging Face discussions.
π Citation
If you use this model, please cite:
@misc{shrav20colloquial,
author = {Shrav20},
title = {Colloquial Tamil Machine Translation Model},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/Shrav20/colloquial-tamil-mt}
}
π± Future Improvements
β More diverse datasets β Better handling of Tamil-English code-mixing β Improved sentence fluency with RLHF (Reinforcement Learning with Human Feedback)
- Downloads last month
- 102
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.