--- license: apache-2.0 pipeline_tag: automatic-speech-recognition tags: - pytorch - audio - speech - automatic-speech-recognition - whisper - wav2vec2 model-index: - name: whisper_medium_fp16_transformers results: - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: type: librispeech_asr name: LibriSpeech (clean) config: clean split: test args: language: en metrics: - type: wer value: 0 name: Test WER description: Word Error Rate - type: mer value: 0 name: Test MER description: Match Error Rate - type: wil value: 0 name: Test WIL description: Word Information Lost - type: wip value: 0 name: Test WIP description: Word Information Preserved - type: cer value: 0 name: Test CER description: Character Error Rate - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: type: librispeech_asr name: LibriSpeech (other) config: other split: test args: language: en metrics: - type: wer value: 0 name: Test WER description: Word Error Rate - type: mer value: 0 name: Test MER description: Match Error Rate - type: wil value: 0 name: Test WIL description: Word Information Lost - type: wip value: 0 name: Test WIP description: Word Information Preserved - type: cer value: 0 name: Test CER description: Character Error Rate - task: type: automatic-speech-recognition name: Automatic Speech Recognition dataset: type: mozilla-foundation/common_voice_14_0 name: Common Voice (14.0) (Hindi) config: hi split: test args: language: hi metrics: - type: wer value: 54.97 name: Test WER description: Word Error Rate - type: mer value: 47.86 name: Test MER description: Match Error Rate - type: wil value: 66.83 name: Test WIL description: Word Information Lost - type: wip value: 33.16 name: Test WIP description: Word Information Preserved - type: cer value: 30.23 name: Test CER description: Character Error Rate widget: - example_title: Hinglish Sample src: https://huggingface.co/devasheeshG/whisper_medium_fp16_transformers/resolve/main/test.wav - example_title: Librispeech sample 1 src: https://cdn-media.huggingface.co/speech_samples/sample1.flac - example_title: Librispeech sample 2 src: https://cdn-media.huggingface.co/speech_samples/sample2.flac language: - en - zh - de - es - ru - ko - fr - ja - pt - tr - pl - ca - nl - ar - sv - it - id - hi - fi - vi - he - uk - el - ms - cs - ro - da - hu - ta - 'no' - th - ur - hr - bg - lt - la - mi - ml - cy - sk - te - fa - lv - bn - sr - az - sl - kn - et - mk - br - eu - is - hy - ne - mn - bs - kk - sq - sw - gl - mr - pa - si - km - sn - yo - so - af - oc - ka - be - tg - sd - gu - am - yi - lo - uz - fo - ht - ps - tk - nn - mt - sa - lb - my - bo - tl - mg - as - tt - haw - ln - ha - ba - jw - su --- ## Versions: - CUDA: 12.1 - cuDNN Version: 8.9.2.26_1.0-1_amd64 * tensorflow Version: 2.12.0 * torch Version: 2.1.0.dev20230606+cu12135 * transformers Version: 4.30.2 * accelerate Version: 0.20.3 ## Model Benchmarks: - RAM: 2.8 GB (Original_Model: 5.5GB) - VRAM: 1812 MB (Original_Model: 6GB) - test.wav: 23 s (Multilingual Speech i.e. English+Hindi) - **Time in seconds for Processing by each device** | Device Name | float32 (Original) | float16 | CudaCores | TensorCores | | ----------------- | -------------------- | ------- | --------- | ----------- | | 3060 | 1.7 | 1.1 | 3,584 | 112 | | 1660 Super | OOM | 3.3 | 1,408 | N/A | | Collab (Tesla T4) | 2.8 | 2.2 | 2,560 | 320 | | Collab (CPU) | 35 | N/A | N/A | N/A | | M1 (CPU) | - | - | - | - | | M1 (GPU -> 'mps') | - | - | - | - | - **NOTE: TensorCores are efficient in mixed-precision calculations** - **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)** - Punchuation: True ## Model Error Benchmarks: - **WER: Word Error Rate** - **MER: Match Error Rate** - **WIL: Word Information Lost** - **WIP: Word Information Preserved** - **CER: Character Error Rate** ### Hindi (test.tsv) [Common Voice 14.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_14_0) **Test done on RTX 3060 on 2557 Samples** | | WER | MER | WIL | WIP | CER | | ----------------------- | -------------------- | ------- | --------- | ----------- | ----- | | Original_Model (54 min) | 52.02 | 47.86 | 66.82 | 33.17 | 23.76 | | This_Model (38 min) | 54.97 | 47.86 | 66.83 | 33.16 | 30.23 | ### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-clean) **Test done on RTX 3060 on __ Samples** | | WER | MER | WIL | WIP | CER | | ----------------- | -------------------- | ------- | --------- | ----------- | --- | | Original_Model | - | - | - | - | - | | This_Model | - | - | - | - | - | ### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-other) **Test done on RTX 3060 on __ Samples** | | WER | MER | WIL | WIP | CER | | ----------------- | -------------------- | ------- | --------- | ----------- | --- | | Original_Model | - | - | - | - | - | | This_Model | - | - | - | - | - | - **'jiwer' library is used for calculations** ## Code for conversion: - ### [Will be soon Uploaded on Github](https://github.com/devasheeshG) ## Usage A file ``__init__.py`` is contained inside this repo which contains all the code to use this model. Firstly, clone this repo and place all the files inside a folder. ### Make sure you have git-lfs installed (https://git-lfs.com) ```bash git lfs install git clone https://huggingface.co/devasheeshG/whisper_medium_fp16_transformers ``` **Please try in jupyter notebook** ```python # Import the Model from whisper_medium_fp16_transformers import Model ``` ```python # Initilise the model model = Model( model_name_or_path='whisper_medium_fp16_transformers', cuda_visible_device="0", device='cuda', ) ``` ```python # Load Audio audio = model.load_audio('whisper_medium_fp16_transformers/test.wav') ``` ```python # Transcribe (First transcription takes time) model.transcribe(audio) ``` ## Credits It is fp16 version of ```openai/whisper-medium```