You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Dhwani - Indic Speech To Text Translation

Static Badge Static Badge Static Badge Static Badge

Introduction

Dhwani enables Speech-to-Text Translation for Indic Languages. It supports translation from Indic Language (X) β†’ English and vice-versa.

Dhwani

Model Summary

Current model trained using SALMONN architecture.

PreTraining

  • Speech Encoder: Utilizes the Whisper model's speech encoder to process speech inputs.
  • Audio Encoder: Employs the BEATs audio encoder for non-speech audio inputs, such as environmental sounds and music.
  • Connection Module: Uses the Window-Level Query Transformer (Q-Former) to bridge the audio encoders and the Large Language Model (LLM).
  • Large Language Model (LLM): The Krutrim LLM receives the processed tokens, handling audio-derived information.
  • Adaptation Mechanism: Low-Rank Adaptation (LoRA) is applied to fine-tune the LLM to align the audio inputs with the model's output.

PostTraining

To adapt Q-former and LoRA weights, we used techniques mentioned in the paper IndicST. Along with the IndicST translation dataset, we also used in-house-collected translation data to further improve the performance of translation results.

Model Downloads

  1. Download whisper large v2.
  2. Download Fine-tuned BEATs_iter3+ (AS2M) (cpt2).
  3. Download krutrim llm.
  4. Download ckpt.

Evaluation Results

En β†’ Indic (X) BLEU Scores:

Language Pair BLEU Score
en β†’ hin 57.7
en β†’ guj 44.3
en β†’ mar 43.3
en β†’ ben 49.0
en β†’ tam 47.0
en β†’ tel 40.8
en β†’ mal 39.0
en β†’ kan 47.0
Average 46.0

Indic (X) β†’ En BLEU Scores:

Language Pair BLEU Score
hin β†’ en 35.7
guj β†’ en 34.6
mar β†’ en 33.2
ben β†’ en 19.2
tam β†’ en 25.4
tel β†’ en 17.4
mal β†’ en 38.9
kan β†’ en 28.0
Average 30.0

API Platform

Visit Dhwani Online to access the model via the web interface.

How to inference in CLI

  1. conda create -n dhwani_env python=3.9.17
  2. Our environment: The python version is 3.9.17, and other required packages can be installed with the following command: pip install -r requirements.txt.
  3. add whisper large v2 to whisper_path.
  4. add Fine-tuned BEATs_iter3+ (AS2M) (cpt2) to beats_path.
  5. add krutrim llm to llama_path.
  6. add krutrim ckpt path to ckpt.
  7. Running with python3 cli_inference.py --cfg-path configs/decode_config.yaml in A100-SXM-80GB. Now you can input wav_path and prompt. Enjoy yourself !

How to infer the model

  1. Same as How to inference in CLI: 1-4.
  2. add krutrim ckpt path to ckpt.
  3. Running with python3 infer.py --cfg-path configs/decode_config.yaml in A100-SXM-80GB.

License

This code repository and the model weights are licensed under the Krutrim Community License.

Citation

@inproceedings{
  sanket2025IndicST,
  title={{IndicST}: Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models},
  author={Sanket Shah, Kavya Ranjan Saxena, Kancharana Manideep Bharadwaj, Sharath Adavanne, Nagaraj Adiga},
  booktitle={Proc. ICASSP},
  year={2025},
}

Contact

Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the HF Inference API does not support pytorch models with pipeline type audio-text-to-text

Model tree for krutrim-ai-labs/Dhwani

Finetuned
(1)
this model