Dhwani - Indic Speech To Text Translation

Introduction

Dhwani enables Speech-to-Text Translation for Indic Languages. It supports translation from Indic Language (X) → English and vice-versa.

Model Summary

Current model trained using SALMONN architecture.

PreTraining

Speech Encoder: Utilizes the Whisper model's speech encoder to process speech inputs.
Audio Encoder: Employs the BEATs audio encoder for non-speech audio inputs, such as environmental sounds and music.
Connection Module: Uses the Window-Level Query Transformer (Q-Former) to bridge the audio encoders and the Large Language Model (LLM).
Large Language Model (LLM): The Krutrim LLM receives the processed tokens, handling audio-derived information.
Adaptation Mechanism: Low-Rank Adaptation (LoRA) is applied to fine-tune the LLM to align the audio inputs with the model's output.

PostTraining

To adapt Q-former and LoRA weights, we used techniques mentioned in the paper IndicST. Along with the IndicST translation dataset, we also used in-house-collected translation data to further improve the performance of translation results.

Evaluation Results

En → Indic (X) BLEU Scores:

Language Pair	BLEU Score
en → hin	57.7
en → guj	44.3
en → mar	43.3
en → ben	49.0
en → tam	47.0
en → tel	40.8
en → mal	39.0
en → kan	47.0
Average	46.0

Indic (X) → En BLEU Scores:

Language Pair	BLEU Score
hin → en	35.7
guj → en	34.6
mar → en	33.2
ben → en	19.2
tam → en	25.4
tel → en	17.4
mal → en	38.9
kan → en	28.0
Average	30.0

API Platform

Visit Dhwani Online to access the model via the web interface.

How to inference in CLI

Clone the repository: git clone https://github.com/ola-krutrim/Dhwani
Install the environment: conda create -n dhwani_env python=3.9.17
Activate the environment: conda activate dhwani_env
Install the requirements: pip install -r requirements.txt
Run CLI: python3 cli_inference.py --cfg-path configs/decode_config.yaml in A100-SXM-80GB. Now you can input wav_path and prompt. Enjoy yourself !

How to infer the model

Same as How to inference in CLI: 1-3.
Running with python3 infer.py --cfg-path configs/decode_config.yaml in A100-SXM-80GB.

License

This code repository and the model weights are licensed under the Krutrim Community License.

Citation

@inproceedings{
  sanket2025IndicST,
  title={{IndicST}: Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models},
  author={Sanket Shah, Kavya Ranjan Saxena, Kancharana Manideep Bharadwaj, Sharath Adavanne, Nagaraj Adiga},
  booktitle={Proc. ICASSP},
  year={2025},
}

Contact

Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.

krutrim-ai-labs
/

Dhwani

You need to agree to share your contact information to access this model