speechbrain
/

noisy-whisper-rescuespeech

 ---
+language:
+- de
+thumbnail: null
+pipeline_tag: automatic-speech-recognition
+tags:
+- whisper
+- pytorch
+- speechbrain
+- Transformer
 license: apache-2.0
+datasets:
+- RescueSpeech
+metrics:
+- wer
+- sisnri
+- sdri
+- pesq
+- stoi
+model-index:
+- name: noisy-whisper-resucespeech
+  results:
+  - task:
+      name: Noise Robust Automatic Speech Recognition
+      type: noise-robust-automatic-speech-recognition
+    dataset:
+      name: RescueSpeech
+      type: zenodo.org/record/8077622
+      config: de
+      split: test
+      args:
+        language: de
+    metrics:
+    - name: Test WER
+      type: wer
+      value: '24.20'
+    - name: Test PESQ
+        type: pesq
+        value: '2.085'
+    - name: Test SI-SNRi
+        type: si-snri
+        value: '7.334'
+    - name: Test SI-SDRi
+        type: si-sdri
+        value: '7.871'
 ---
+# Noise robust speech recognition on jointly trained SepFormer speech enhancement and Whisper ASR using RescueSpeech data.
+This repository provides all the necessary tools to perform noise automatic speech
+recognition on a simple combination of an enhancement model (**SepFormer**) and speech recognizer (**Whisper**).
+Initially, the models are fine-tuned individually on the RescueSpeech dataset, and then they are integrated to undergo joint training, enabling them to effectively handle noise interference. For a better experience, we encourage you to learn more about
+[SpeechBrain](https://speechbrain.github.io).
+The performance of the model is the following:
+| Release | SISNRi | SDRi | PESQ | STOI | WER | GPUs |
+|:-------------:|:--------------:|:--------------:| :--------:|:--------------:| :--------:|:--------:|
+| 07-11-23 | 7.334	 |  7.871	 | 2.085	| 0.857 | 24.20 | 1xA100 80 GB |
+## Pipeline description
+- The enhancement system is composed of SepFormer model.
+  - The model is first trained on Microsoft-DNS dataset and subsequently fine-tuned on RescueSpeech dataset.
+  - The enhanced utterances are fed to the ASR model.
+- And the ASR system is composed of whisper encoder-decoder blocks:
+  - The pretrained whisper-large-v2 encoder is frozen.
+  - The pretrained Whisper tokenizer is used.
+  - A pretrained Whisper-large-v2 decoder ([openai/whisper-large-v2](https://huggingface.co/openai/whisper-large-v2)) is finetuned on RescueSpeech dataset.
+  The obtained final acoustic representation is given to the greedy decoder.
+The system is trained with recordings sampled at 16kHz (single channel).
+The code will automatically normalize your audio (i.e., resampling + mono channel selection) when calling *transcribe_file* if needed.
+## Install SpeechBrain
+First of all, please install tranformers and SpeechBrain with the following command:
+```
+pip install speechbrain transformers==4.28.0
+```
+Please notice that we encourage you to read our tutorials and learn more about
+[SpeechBrain](https://speechbrain.github.io).
+### Transcribing your own audio files (in German)
+```python
+from speechbrain.pretrained import WhisperASR
+asr_model = WhisperASR.from_hparams(source="speechbrain/rescuespeech_whisper", savedir="pretrained_models/rescuespeech_whisper")
+asr_model.transcribe_file("speechbrain/rescuespeech_whisper/example_de.wav")
+```
+### Inference on GPU
+To perform inference on the GPU, add  `run_opts={"device":"cuda"}`  when calling the `from_hparams` method.
+You can find our training results (models, logs, etc) [here](https://www.dropbox.com/sh/7tryj6n7cfy0poe/AADpl4b8rGRSnoQ5j6LCj9tua?dl=0).
+### Limitations
+The SpeechBrain team does not provide any warranty on the performance achieved by this model when used on other datasets.
+#### Referencing SpeechBrain
+```
+@misc{SB2021,
+    author = {Ravanelli, Mirco and Parcollet, Titouan and Rouhe, Aku and Plantinga, Peter and Rastorgueva, Elena and Lugosch, Loren and Dawalatabad, Nauman and Ju-Chieh, Chou and Heba, Abdel and Grondin, Francois and Aris, William and Liao, Chien-Feng and Cornell, Samuele and Yeh, Sung-Lin and Na, Hwidong and Gao, Yan and Fu, Szu-Wei and Subakan, Cem and De Mori, Renato and Bengio, Yoshua },
+    title = {SpeechBrain},
+    year = {2021},
+    publisher = {GitHub},
+    journal = {GitHub repository},
+    howpublished = {\\\\url{https://github.com/speechbrain/speechbrain}},
+  }
+```
+### Referencing RescueSpeech
+```bibtex
+@misc{sagar2023rescuespeech,
+    title={RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain},
+    author={Sangeet Sagar and Mirco Ravanelli and Bernd Kiefer and Ivana Kruijff Korbayova and Josef van Genabith},
+    year={2023},
+    eprint={2306.04054},
+    archivePrefix={arXiv},
+    primaryClass={eess.AS}
+}
+```
+#### About SpeechBrain
+SpeechBrain is an open-source and all-in-one speech toolkit. It is designed to be simple, extremely flexible, and user-friendly. Competitive or state-of-the-art performance is obtained in various domains.
+Website: https://speechbrain.github.io/
+GitHub: https://github.com/speechbrain/speechbrain
+```bash
+from speechbrain.pretrained import SepformerSeparation as Separator
+from speechbrain.pretrained import WhisperASR
+enh_model = Separator.from_hparams(source="CKPT+2023-06-24+21-49-17+00", savedir='pretrained_models/sepformer_rescuespeech', hparams_file='hyperparams_asr.yaml')
+asr_model = WhisperASR.from_hparams(source="CKPT+2023-06-24+21-49-17+00", savedir="pretrained_models/whisper_rescuespeech", hparams_file='hyperparams_asr.yaml')
+# For custom file, change the path accordingly
+est_sources = enh_model.separate_file(path='example_rescuespeech16k.wav')
+print(asr_model(est_sources[:, :, 0]))
+```

encoder.ckpt ADDED Viewed

Binary file (17.3 kB). View file