devasheeshG
/

whisper_large_v2_fp16_transformers

@@ -1,330 +0,0 @@
----
-license: apache-2.0
-pipeline_tag: automatic-speech-recognition
-tags:
-  - pytorch
-  - audio
-  - speech
-  - automatic-speech-recognition
-  - whisper
-  - wav2vec2
-model-index:
-  - name: whisper_medium_fp16_transformers
-    results:
-      - task:
-          type: automatic-speech-recognition
-          name: Automatic Speech Recognition
-        dataset:
-          type: librispeech_asr
-          name: LibriSpeech (clean)
-          config: clean
-          split: test
-          args:
-            language: en
-        metrics:
-          - type: wer
-            value: 0
-            name: Test WER
-            description: Word Error Rate
-          - type: mer
-            value: 0
-            name: Test MER
-            description: Match Error Rate
-          - type: wil
-            value: 0
-            name: Test WIL
-            description: Word Information Lost
-          - type: wip
-            value: 0
-            name: Test WIP
-            description: Word Information Preserved
-          - type: cer
-            value: 0
-            name: Test CER
-            description: Character Error Rate
-      - task:
-          type: automatic-speech-recognition
-          name: Automatic Speech Recognition
-        dataset:
-          type: librispeech_asr
-          name: LibriSpeech (other)
-          config: other
-          split: test
-          args:
-            language: en
-        metrics:
-          - type: wer
-            value: 0
-            name: Test WER
-            description: Word Error Rate
-          - type: mer
-            value: 0
-            name: Test MER
-            description: Match Error Rate
-          - type: wil
-            value: 0
-            name: Test WIL
-            description: Word Information Lost
-          - type: wip
-            value: 0
-            name: Test WIP
-            description: Word Information Preserved
-          - type: cer
-            value: 0
-            name: Test CER
-            description: Character Error Rate
-      - task:
-          type: automatic-speech-recognition
-          name: Automatic Speech Recognition
-        dataset:
-          type: mozilla-foundation/common_voice_14_0
-          name: Common Voice (14.0) (Hindi)
-          config: hi
-          split: test
-          args:
-            language: hi
-        metrics:
-          - type: wer
-            value: 44.64
-            name: Test WER
-            description: Word Error Rate
-          - type: mer
-            value: 41.69
-            name: Test MER
-            description: Match Error Rate
-          - type: wil
-            value: 59.53
-            name: Test WIL
-            description: Word Information Lost
-          - type: wip
-            value: 40.46
-            name: Test WIP
-            description: Word Information Preserved
-          - type: cer
-            value: 16.80
-            name: Test CER
-            description: Character Error Rate
-widget:
-  - example_title: Hinglish Sample
-    src: https://huggingface.co/devasheeshG/whisper_medium_fp16_transformers/resolve/main/test.wav
-  - example_title: Librispeech sample 1
-    src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
-  - example_title: Librispeech sample 2
-    src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
-language:
-  - en
-  - zh
-  - de
-  - es
-  - ru
-  - ko
-  - fr
-  - ja
-  - pt
-  - tr
-  - pl
-  - ca
-  - nl
-  - ar
-  - sv
-  - it
-  - id
-  - hi
-  - fi
-  - vi
-  - he
-  - uk
-  - el
-  - ms
-  - cs
-  - ro
-  - da
-  - hu
-  - ta
-  - "no"
-  - th
-  - ur
-  - hr
-  - bg
-  - lt
-  - la
-  - mi
-  - ml
-  - cy
-  - sk
-  - te
-  - fa
-  - lv
-  - bn
-  - sr
-  - az
-  - sl
-  - kn
-  - et
-  - mk
-  - br
-  - eu
-  - is
-  - hy
-  - ne
-  - mn
-  - bs
-  - kk
-  - sq
-  - sw
-  - gl
-  - mr
-  - pa
-  - si
-  - km
-  - sn
-  - yo
-  - so
-  - af
-  - oc
-  - ka
-  - be
-  - tg
-  - sd
-  - gu
-  - am
-  - yi
-  - lo
-  - uz
-  - fo
-  - ht
-  - ps
-  - tk
-  - nn
-  - mt
-  - sa
-  - lb
-  - my
-  - bo
-  - tl
-  - mg
-  - as
-  - tt
-  - haw
-  - ln
-  - ha
-  - ba
-  - jw
-  - su
----
-## Versions:
-- CUDA: 12.1
-- cuDNN Version: 8.9.2.26_1.0-1_amd64
-- tensorflow Version: 2.12.0
-- torch Version: 2.1.0.dev20230606+cu12135
-- transformers Version: 4.30.2
-- accelerate Version: 0.20.3
-## Model Benchmarks:
-- RAM: 3 GB (Original_Model: 6GB)
-- VRAM: 3.7 GB (Original_Model: 11GB)
-- test.wav: 23 s (Multilingual Speech i.e. English+Hindi)
-  - **Time in seconds for Processing by each device**
-  | Device Name       | float32 (Original) | float16 | CudaCores | TensorCores |
-  | ----------------- | ------------------ | ------- | --------- | ----------- |
-  | 3060              | 2.2                | 1.3     | 3,584     | 112         |
-  | 1660 Super        | OOM                | 6       | 1,408     | N/A         |
-  | Collab (Tesla T4) | -                  | -       | 2,560     | 320         |
-  | Collab (CPU)      | -                  | N/A     | N/A       | N/A         |
-  | M1 (CPU)          | -                  | -       | N/A       | N/A         |
-  | M1 (GPU -> 'mps') | -                  | -       | N/A       | N/A         |
-  - **NOTE: TensorCores are efficient in mixed-precision calculations**
-  - **CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab CPU)**
-- Punchuation: False ('I don't know the exact reason why this is hapening :)')
-## Model Error Benchmarks:
-- **WER: Word Error Rate**
-- **MER: Match Error Rate**
-- **WIL: Word Information Lost**
-- **WIP: Word Information Preserved**
-- **CER: Character Error Rate**
-### Hindi (test.tsv) [Common Voice 14.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_14_0)
-**Test done on RTX 3060 on 1000 Samples**
-|                         | WER   | MER   | WIL   | WIP   | CER   |
-| ----------------------- | ----- | ----- | ----- | ----- | ----- |
-| Original_Model (30 min) | 43.99 | 41.65 | 59.47 | 40.52 | 16.23 |
-| This_Model (20 min)     | 44.64 | 41.69 | 59.53 | 40.46 | 16.80 |
-### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-clean)
-**Test done on RTX 3060 on \_\_\_ Samples**
-|                | WER | MER | WIL | WIP | CER |
-| -------------- | --- | --- | --- | --- | --- |
-| Original_Model | -   | -   | -   | -   | -   |
-| This_Model     | -   | -   | -   | -   | -   |
-### English ([LibriSpeech](https://huggingface.co/datasets/librispeech_asr) -> test-other)
-**Test done on RTX 3060 on \_\_\_ Samples**
-|                | WER | MER | WIL | WIP | CER |
-| -------------- | --- | --- | --- | --- | --- |
-| Original_Model | -   | -   | -   | -   | -   |
-| This_Model     | -   | -   | -   | -   | -   |
-- **'jiwer' library is used for calculations**
-## Code for conversion:
-- ### [Will be soon Uploaded on Github](https://github.com/devasheeshG)
-## Usage
-A file `__init__.py` is contained inside this repo which contains all the code to use this model.
-Firstly, clone this repo and place all the files inside a folder.
-### Make sure you have git-lfs installed (https://git-lfs.com)
-```bash
-git lfs install
-git clone https://huggingface.co/devasheeshG/whisper_large_v2_fp16_transformers
-```
-**Please try in jupyter notebook**
-```python
-# Import the Model
-from whisper_large_v2_fp16_transformers import Model
-```
-```python
-# Initilise the model
-model = Model(
-            model_name_or_path='whisper_large_v2_fp16_transformers',
-            cuda_visible_device="0",
-            device='cuda',
-      )
-```
-```python
-# Load Audio
-audio = model.load_audio('whisper_large_v2_fp16_transformers/test.wav')
-```
-```python
-# Transcribe (First transcription takes time)
-model.transcribe(audio)
-```