Dhwani - Indic Speech To Text Translation
Introduction
Dhwani enables Speech-to-Text Translation for Indic Languages. It supports translation from Indic Language (X) β English and vice-versa.
Model Summary
Current model trained using SALMONN architecture.
PreTraining
- Speech Encoder: Utilizes the Whisper model's speech encoder to process speech inputs.
- Audio Encoder: Employs the BEATs audio encoder for non-speech audio inputs, such as environmental sounds and music.
- Connection Module: Uses the Window-Level Query Transformer (Q-Former) to bridge the audio encoders and the Large Language Model (LLM).
- Large Language Model (LLM): The Krutrim LLM receives the processed tokens, handling audio-derived information.
- Adaptation Mechanism: Low-Rank Adaptation (LoRA) is applied to fine-tune the LLM to align the audio inputs with the model's output.
PostTraining
To adapt Q-former and LoRA weights, we used techniques mentioned in the paper IndicST. Along with the IndicST translation dataset, we also used in-house-collected translation data to further improve the performance of translation results.
Model Downloads
- Download whisper large v2.
- Download Fine-tuned BEATs_iter3+ (AS2M) (cpt2).
- Download krutrim llm.
- Download ckpt.
Evaluation Results
En β Indic (X) BLEU Scores:
Language Pair | BLEU Score |
---|---|
en β hin | 57.7 |
en β guj | 44.3 |
en β mar | 43.3 |
en β ben | 49.0 |
en β tam | 47.0 |
en β tel | 40.8 |
en β mal | 39.0 |
en β kan | 47.0 |
Average | 46.0 |
Indic (X) β En BLEU Scores:
Language Pair | BLEU Score |
---|---|
hin β en | 35.7 |
guj β en | 34.6 |
mar β en | 33.2 |
ben β en | 19.2 |
tam β en | 25.4 |
tel β en | 17.4 |
mal β en | 38.9 |
kan β en | 28.0 |
Average | 30.0 |
API Platform
Visit Dhwani Online to access the model via the web interface.
How to inference in CLI
- conda create -n dhwani_env python=3.9.17
- Our environment: The python version is 3.9.17, and other required packages can be installed with the following command:
pip install -r requirements.txt
. - add whisper large v2 to
whisper_path
. - add Fine-tuned BEATs_iter3+ (AS2M) (cpt2) to
beats_path
. - add krutrim llm to
llama_path
. - add krutrim ckpt path to
ckpt
. - Running with
python3 cli_inference.py --cfg-path configs/decode_config.yaml
in A100-SXM-80GB. Now you can inputwav_path
andprompt
. Enjoy yourself !
How to infer the model
- Same as How to inference in CLI: 1-4.
- add krutrim ckpt path to
ckpt
. - Running with
python3 infer.py --cfg-path configs/decode_config.yaml
in A100-SXM-80GB.
License
This code repository and the model weights are licensed under the Krutrim Community License.
Citation
@inproceedings{
sanket2025IndicST,
title={{IndicST}: Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models},
author={Sanket Shah, Kavya Ranjan Saxena, Kancharana Manideep Bharadwaj, Sharath Adavanne, Nagaraj Adiga},
booktitle={Proc. ICASSP},
year={2025},
}
Contact
Contributions are welcome! If you have any improvements or suggestions, feel free to submit a pull request on GitHub.
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the HF Inference API does not support pytorch models with pipeline type audio-text-to-text
Model tree for krutrim-ai-labs/Dhwani
Base model
krutrim-ai-labs/Krutrim-1-instruct