Model Card for Shrav20/colloquial-tamil-mt

πŸ“Œ Model Summary

This model is a Machine Translation (MT) model designed for converting English to colloquial Tamil and vice versa. Unlike traditional Tamil MT models, which focus on formal Tamil, this model generates translations in natural spoken Tamil commonly used in everyday conversations.

πŸ“Š Model Details

  • Developed by: Shrav20
  • Funded by: Independent
  • Shared by: Shrav20
  • Model Type: Sequence-to-Sequence (Seq2Seq) Translation
  • Architecture: Based on M2M100 (Facebook’s Multilingual MT Model), finetuned for colloquial Tamil.
  • Languages Supported:
    • English β†’ Tamil (Colloquial)
    • Tamil (Colloquial) β†’ English
  • License: MIT
  • Finetuned from: facebook/m2m100_418M

πŸ›  Model Usage

πŸ”Ή Direct Use

You can use this model for colloquial Tamil translation in conversational AI, subtitles, and chatbots.

Example Code:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Shrav20/colloquial-tamil-mt")
model = AutoModelForSeq2SeqLM.from_pretrained("Shrav20/colloquial-tamil-mt")

def translate(text, src_lang="en", tgt_lang="ta"):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    output = model.generate(**inputs)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Example Translation
print(translate("The pharmacy is near the bus stop. "))  # Output: "Bus stop pakkathula pharmacy iruku."

πŸ“– Training Details

πŸ“Œ Training Dataset

  • This model is finetuned on Shrav20/colloquial-tamil dataset.
  • Sources:
    • sangeethat/colloquial
    • AI-generated data
    • Internet-scraped content
    • Manually verified colloquial sentences

πŸ›  Training Hyperparameters

  • Batch Size: 16
  • Learning Rate: 5e-5
  • Epochs: 3
  • Optimizer: AdamW
  • Precision: fp16 (mixed precision)
  • LoRA Adapters: Enabled for efficient fine-tuning

πŸ“Š Evaluation

πŸ“Œ Testing Data & Metrics

  • Dataset: 5,000 colloquial Tamil-English sentence pairs
  • Evaluation Metrics:
    • BLEU Score: 28.5
    • METEOR Score: 34.1
    • TER: 41.2

πŸ“Œ Example Outputs

English Tamil (Colloquial)
The pharmacy is near the bus stop. Bus stop pakkathula pharmacy iruku.
Take this medicine after food. Food saptadhukku apram intha medicine eduthukungo.
Train tickets for tomorrow are available. Naalaikku train tickets available iruku.

🚨 Bias, Risks, and Limitations

  • Dialectal Bias: The model is trained on a specific style of spoken Tamil and may not generalize to all Tamil dialects.
  • Data Noise: Some AI-generated content may not be fully accurate.
  • Context Sensitivity: Model struggles with complex sentence structures and ambiguous meanings.

πŸ’‘ How to Contribute

  • If you find issues or have improvements, feel free to open a GitHub issue or contribute data via Hugging Face!

πŸ“© Contact: Shrav20 via Hugging Face discussions.


πŸ“ Citation

If you use this model, please cite:

@misc{shrav20colloquial,
  author = {Shrav20},
  title = {Colloquial Tamil Machine Translation Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/Shrav20/colloquial-tamil-mt}
}

🌱 Future Improvements

βœ… More diverse datasets βœ… Better handling of Tamil-English code-mixing βœ… Improved sentence fluency with RLHF (Reinforcement Learning with Human Feedback)

Downloads last month
102
Safetensors
Model size
484M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for Shrav20/colloquial-tamil-mt

Adapters
1 model