devasheeshG's picture
Upload 2 files
b289b78
---
license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
- pytorch
- audio
- speech
- automatic-speech-recognition
- whisper
- wav2vec2
model-index:
- name: whisper_large_v2_fp16_transformers
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
type: librispeech_asr
name: LibriSpeech (clean)
config: clean
split: test
args:
language: en
metrics:
- type: wer
value: 0
name: Test WER
description: Word Error Rate
- type: mer
value: 0
name: Test MER
description: Match Error Rate
- type: wil
value: 0
name: Test WIL
description: Word Information Lost
- type: wip
value: 0
name: Test WIP
description: Word Information Preserved
- type: cer
value: 0
name: Test CER
description: Character Error Rate
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
type: librispeech_asr
name: LibriSpeech (other)
config: other
split: test
args:
language: en
metrics:
- type: wer
value: 0
name: Test WER
description: Word Error Rate
- type: mer
value: 0
name: Test MER
description: Match Error Rate
- type: wil
value: 0
name: Test WIL
description: Word Information Lost
- type: wip
value: 0
name: Test WIP
description: Word Information Preserved
- type: cer
value: 0
name: Test CER
description: Character Error Rate
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
type: mozilla-foundation/common_voice_14_0
name: Common Voice (14.0) (Hindi)
config: hi
split: test
args:
language: hi
metrics:
- type: wer
value: 44.64
name: Test WER
description: Word Error Rate
- type: mer
value: 41.69
name: Test MER
description: Match Error Rate
- type: wil
value: 59.53
name: Test WIL
description: Word Information Lost
- type: wip
value: 40.46
name: Test WIP
description: Word Information Preserved
- type: cer
value: 16.80
name: Test CER
description: Character Error Rate
widget:
- example_title: Hinglish Sample
src: https://huggingface.co/devasheeshG/whisper_large_v2_fp16_transformers/resolve/main/test.wav
- example_title: Librispeech sample 1
src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- example_title: Librispeech sample 2
src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
language:
- en
- zh
- de
- es
- ru
- ko
- fr
- ja
- pt
- tr
- pl
- ca
- nl
- ar
- sv
- it
- id
- hi
- fi
- vi
- he
- uk
- el
- ms
- cs
- ro
- da
- hu
- ta
- "no"
- th
- ur
- hr
- bg
- lt
- la
- mi
- ml
- cy
- sk
- te
- fa
- lv
- bn
- sr
- az
- sl
- kn
- et
- mk
- br
- eu
- is
- hy
- ne
- mn
- bs
- kk
- sq
- sw
- gl
- mr
- pa
- si
- km
- sn
- yo
- so
- af
- oc
- ka
- be
- tg
- sd
- gu
- am
- yi
- lo
- uz
- fo
- ht
- ps
- tk
- nn
- mt
- sa
- lb
- my
- bo
- tl
- mg
- as
- tt
- haw
- ln
- ha
- ba
- jw
- su
---
## Versions:
- CUDA: 12.1
- cuDNN Version: 8.9.2.26_1.0-1_amd64
* tensorflow Version: 2.12.0
* torch Version: 2.1.0.dev20230606+cu12135
* transformers Version: 4.30.2
* accelerate Version: 0.20.3
## Model Benchmarks:
- RAM: 3 GB (Original_Model: 6GB)
- VRAM: 3.7 GB (Original_Model: 11GB)
- test.wav: 23 s (Multilingual Speech i.e. English+Hindi)
- **Time in seconds for Processing by each device**
| Device Name | float32 (Original) | float16 | CudaCores | TensorCores |
| ----------------- | ------------------ | ------- | --------- | ----------- |
| 3060 | 2.2 | 1.3 | 3,584 | 112 |
| 1660 Super | OOM | 6 | 1,408 | N/A |
| Collab (Tesla T4) | - | - | 2,560 | 320 |
| Collab (CPU) | - | N/A | N/A | N/A |
| M1 (CPU) | - | - | N/A | N/A |
| M1 (GPU -> 'mps') | - | - | N/A | N/A |
- **NOTE: TensorCores are efficient in mixed-precision calculations**
- **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)**
- Punchuation: Sometimes False ('I don't know the exact reason why this is happening')
## Model Error Benchmarks:
- **WER: Word Error Rate**
- **MER: Match Error Rate**
- **WIL: Word Information Lost**
- **WIP: Word Information Preserved**
- **CER: Character Error Rate**
### Hindi to Hindi (test.tsv) [Common Voice 14.0](https://commonvoice.mozilla.org/en/datasets)
**Test done on RTX 3060 on 1000 Samples**
| | WER | MER | WIL | WIP | CER |
| ----------------------- | ----- | ----- | ----- | ----- | ----- |
| Original_Model (30 min) | 43.99 | 41.65 | 59.47 | 40.52 | 16.23 |
| This_Model (20 min) | 44.64 | 41.69 | 59.53 | 40.46 | 16.80 |
### Hindi to English (test.csv) [Custom Dataset](https://huggingface.co/datasets/devasheeshG/common_voices_14_0_hi2en_hi2hi)
**Test done on RTX 3060 on 1000 Samples**
| | WER | MER | WIL | WIP | CER |
| ----------------------- | --- | --- | --- | --- | --- |
| Original_Model (30 min) | - | - | - | - | - |
| This_Model (20 min) | - | - | - | - | - |
### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-clean)
**Test done on RTX 3060 on \_\_\_ Samples**
| | WER | MER | WIL | WIP | CER |
| -------------- | --- | --- | --- | --- | --- |
| Original_Model | - | - | - | - | - |
| This_Model | - | - | - | - | - |
### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-other)
**Test done on RTX 3060 on \_\_\_ Samples**
| | WER | MER | WIL | WIP | CER |
| -------------- | --- | --- | --- | --- | --- |
| Original_Model | - | - | - | - | - |
| This_Model | - | - | - | - | - |
- **'jiwer' library is used for calculations**
## Code for conversion:
- ### [Will be soon Uploaded on Github](https://github.com/devasheeshG)
## Usage
A file `__init__.py` is contained inside this repo which contains all the code to use this model.
Firstly, clone this repo and place all the files inside a folder.
### Make sure you have git-lfs installed (https://git-lfs.com)
```bash
git lfs install
git clone https://huggingface.co/devasheeshG/whisper_large_v2_fp16_transformers
```
**Please try in jupyter notebook**
```python
# Import the Model
from whisper_large_v2_fp16_transformers import Model, load_audio, pad_or_trim
```
```python
# Initilise the model
model = Model(
model_name_or_path='whisper_large_v2_fp16_transformers',
cuda_visible_device="0",
device='cuda',
)
```
```python
# Load Audio
audio = load_audio('whisper_large_v2_fp16_transformers/test.wav')
audio = pad_or_trim(audio)
```
```python
# Transcribe (First transcription takes time)
model.transcribe(audio)
```
## Credits
It is fp16 version of ``openai/whisper-large-v2``