README.md · devasheeshG/whisper_medium_fp16_transformers at cc1af939dd97a0584c0c0cdf1a4038a774d2d6a0

metadata

license: apache-2.0
pipeline_tag: automatic-speech-recognition
tags:
  - pytorch
  - audio
  - speech
  - automatic-speech-recognition
  - whisper
  - wav2vec2
model-index:
  - name: whisper_medium_fp16_transformers
    results:
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          type: librispeech_asr
          name: LibriSpeech (clean) (English)
          config: en
          split: test
          args:
            language: en
        metrics:
          - type: wer
            value: 0
            name: Test WER
            description: Word Error Rate
          - type: mer
            value: 0
            name: Test MER
            description: Match Error Rate
          - type: wil
            value: 0
            name: Test WIL
            description: Word Information Lost
          - type: wip
            value: 0
            name: Test WIP
            description: Word Information Preserved
          - type: cer
            value: 0
            name: Test CER
            description: Character Error Rate
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          type: librispeech_asr
          name: LibriSpeech (other) (English)
          config: en
          split: test
          args:
            language: en
        metrics:
          - type: wer
            value: 0
            name: Test WER
            description: Word Error Rate
          - type: mer
            value: 0
            name: Test MER
            description: Match Error Rate
          - type: wil
            value: 0
            name: Test WIL
            description: Word Information Lost
          - type: wip
            value: 0
            name: Test WIP
            description: Word Information Preserved
          - type: cer
            value: 0
            name: Test CER
            description: Character Error Rate
      - task:
          type: automatic-speech-recognition
          name: Automatic Speech Recognition
        dataset:
          type: common_voice
          name: Common Voice (14.0) (Hindi)
          config: hi
          split: test
          args:
            language: hi
        metrics:
          - type: wer
            value: 54.97
            name: Test WER
            description: Word Error Rate
          - type: mer
            value: 47.86
            name: Test MER
            description: Match Error Rate
          - type: wil
            value: 66.83
            name: Test WIL
            description: Word Information Lost
          - type: wip
            value: 33.16
            name: Test WIP
            description: Word Information Preserved
          - type: cer
            value: 30.23
            name: Test CER
            description: Character Error Rate
widget:
  - example_title: Librispeech sample 1
    src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
  - example_title: Librispeech sample 2
    src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
language:
  - en
  - zh
  - de
  - es
  - ru
  - ko
  - fr
  - ja
  - pt
  - tr
  - pl
  - ca
  - nl
  - ar
  - sv
  - it
  - id
  - hi
  - fi
  - vi
  - he
  - uk
  - el
  - ms
  - cs
  - ro
  - da
  - hu
  - ta
  - 'no'
  - th
  - ur
  - hr
  - bg
  - lt
  - la
  - mi
  - ml
  - cy
  - sk
  - te
  - fa
  - lv
  - bn
  - sr
  - az
  - sl
  - kn
  - et
  - mk
  - br
  - eu
  - is
  - hy
  - ne
  - mn
  - bs
  - kk
  - sq
  - sw
  - gl
  - mr
  - pa
  - si
  - km
  - sn
  - yo
  - so
  - af
  - oc
  - ka
  - be
  - tg
  - sd
  - gu
  - am
  - yi
  - lo
  - uz
  - fo
  - ht
  - ps
  - tk
  - nn
  - mt
  - sa
  - lb
  - my
  - bo
  - tl
  - mg
  - as
  - tt
  - haw
  - ln
  - ha
  - ba
  - jw
  - su

Versions:

CUDA: 12.1
cuDNN Version: 8.9.2.26_1.0-1_amd64

tensorflow Version: 2.12.0
torch Version: 2.1.0.dev20230606+cu12135
transformers Version: 4.30.2
accelerate Version: 0.20.3

Model Benchmarks:

RAM: 2.8 GB (Original_Model: 5.5GB)
VRAM: 1812 MB (Original_Model: 6GB)

test.wav: 23 s (Multilingual Speech i.e. English+Hindi)

Time in seconds for Processing by each device

Device Name	float32 (Original)	float16	CudaCores	TensorCores
3060	1.7	1.1	3,584	112
1660 Super	OOM	3.3	1,408	-
Collab (Tesla T4)	2.8	2.2	2,560	320
Collab (CPU)	35	-	-	-
M1 (CPU)	-	-	-	-
M1 (GPU -> 'mps')	-	-	-	-

NOTE: TensorCores are efficient in mixed-precision calculations
CPU -> torch.float16 not supported on CPU (AMD Ryzen 5 3600 or Collab GPU)

Punchuation: True

Model Error Benchmarks:

WER: Word Error Rate
MER: Match Error Rate
WIL: Word Information Lost
WIP: Word Information Preserved
CER: Character Error Rate

Hindi (test.tsv -> 2557 samples used) Common Voice 14.0

	WER	MER	WIL	WIP	CER
Original_Model	-	-	-	-	-
This_Model	54.97	47.86	66.83	33.16	30.23

English (LibriSpeech -> test-clean -> __ samples used) LibriSpeech

	WER	MER	WIL	WIP	CER
Original_Model	-	-	-	-	-
This_Model	-	-	-	-	-

English (LibriSpeech -> test-other -> __ samples used) LibriSpeech

	WER	MER	WIL	WIP	CER
Original_Model	-	-	-	-	-
This_Model	-	-	-	-	-

'jiwer' library is used for calculations

Code:

$\textbf{Will be soon Uploaded on Github}$

Usage

A file __init__.py is contained inside this repo which contains all the code to use this model.

Firstly, clone this repo and place all the files inside a folder.

Make sure you have git-lfs installed (https://git-lfs.com)

git lfs install
git clone https://huggingface.co/devasheeshG/whisper_medium_fp16_transformers

Please try in jupyter notebook

# Import the Model
from whisper_medium_fp16_transformers import Model

# Initilise the model
model = Model(
            model_name_or_path='whisper_medium_fp16_transformers',
            cuda_visible_device="0", 
            device='cuda',
      )

# Load Audio
audio = model.load_audio('whisper_medium_fp16_transformers/test.wav')

# Transcribe (First transcription takes time)
model.transcribe(audio)