huseinzol05's picture
Update README.md
060eabb verified
|
raw
history blame
4.96 kB
metadata
language:
  - ms
  - en
  - zh
  - ta

Malaysian Finetune Whisper Small V2

Finetune Whisper Small on Malaysian STT Whisper

WanDB at https://wandb.ai/huseinzol05/malaysian-whisper-small-v2, still on training

Improvement

  1. Distilled from Whisper Large V3 on Malaysian and Science context.
  2. Better translation for Malay, Manglish, Mandarin, Tamil and Science context.
  3. Word level timestamp, introduced <|transcribeprecise|> token, a new task!

how to

Load the model,

import torch
from transformers.models.whisper import tokenization_whisper

tokenization_whisper.TASK_IDS = ["translate", "transcribe", "transcribeprecise"]

from transformers import WhisperForConditionalGeneration, WhisperProcessor

processor = WhisperProcessor.from_pretrained(
    'mesolitica/malaysian-whisper-small-v2'
)
tokenizer = processor.tokenizer
model = WhisperForConditionalGeneration.from_pretrained(
    'mesolitica/malaysian-whisper-small-v2', torch_dtype = torch.bfloat16
).cuda().eval()

Transcribe

from datasets import Audio
import requests

sr = 16000
audio = Audio(sampling_rate=sr)

r = requests.get('https://github.com/mesolitica/malaya-speech/raw/master/speech/assembly.mp3')
y = audio.decode_example(audio.encode_example(r.content))['array']

with torch.no_grad():
    p = processor([y], return_tensors='pt')
    p['input_features'] = p['input_features'].to(torch.bfloat16)
    r = model.generate(
        p['input_features'].cuda(),
        output_scores=True,
        return_dict_in_generate=True,
        language='ms',
        return_timestamps=True, task = 'transcribe')

tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(r.sequences[0]))
<|startoftranscript|><|ms|><|transcribe|><|0.02|> Assembly on Aging di Vienna, Australia<|3.78|><|3.78|> yang telah diadakan pada tahun 1982<|6.50|><|6.50|> dan berasaskan unjuran tersebut<|8.82|><|8.82|> maka Jabatan Perangkaan Malaysia<|10.40|><|10.40|> menganggarkan menjelang tahun 2035<|13.72|><|13.72|> sejumlah 15% penduduk kita adalah daripada kalangan warga emas.<|18.72|><|19.28|> Untuk makluman Tuan Yang Pertua dan juga Alia Mbahumat,<|22.12|><|22.26|> pembangunan sistem pendaftaran warga emas<|24.02|><|24.02|> ataupun kita sebutkan event<|25.38|><|25.38|> adalah usaha kerajaan ke arah merealisasikan<|28.40|><|endoftext|>

Transcribe word level timestamp

You must use transcribeprecise for the task, or <|transcribeprecise|> token,

from datasets import Audio
import requests

sr = 16000
audio = Audio(sampling_rate=sr)

r = requests.get('https://github.com/mesolitica/malaya-speech/raw/master/speech/assembly.mp3')
y = audio.decode_example(audio.encode_example(r.content))['array']

with torch.no_grad():
    p = processor([y], return_tensors='pt')
    p['input_features'] = p['input_features'].to(torch.bfloat16)
    r = model.generate(
        p['input_features'].cuda(),
        output_scores=True,
        return_dict_in_generate=True,
        language='ms',
        return_timestamps=True, task = 'transcribeprecise')

tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(r.sequences[0]))
<|startoftranscript|><|ms|><|transcribeprecise|><|0.02|> Assembly<|1.20|><|1.56|> on<|1.64|><|1.74|> Aging<|2.04|><|2.14|> di<|2.22|><|2.26|> Vienna<|2.50|><|2.72|> Australia<|3.12|><|4.26|> yang<|4.38|><|4.42|> telah<|4.58|><|4.62|> diadakan<|5.08|><|5.16|> pada<|5.30|><|5.36|> tahun<|5.60|><|5.62|> 1982<|6.92|><|7.12|> dan<|7.24|><|7.32|> berasaskan<|7.88|><|7.98|> unjuran<|8.36|><|8.42|> tersebut<|8.80|><|8.88|> maka<|9.06|><|9.12|> Jabatan<|9.48|><|9.56|> Perangkaan<|9.98|><|10.04|> Malaysia<|10.36|><|10.84|> menganggarkan<|11.56|><|11.98|> menjelang<|12.34|><|12.40|> tahun<|12.64|><|12.66|> 2035<|14.08|><|14.50|> sejumlah<|14.96|><|14.98|> 15%<|16.14|><|16.26|> penduduk<|16.62|><|16.68|> kita<|16.90|><|17.02|> adalah<|17.30|><|17.40|> daripada<|17.80|><|17.86|> kalangan<|18.16|><|18.22|> warga<|18.40|><|18.46|> emas.<|18.68|><|19.24|> Untuk<|19.40|><|19.46|> makluman<|19.86|><|20.64|> Tuan<|20.76|><|20.82|> Yang<|20.90|><|20.94|> Pertua<|21.14|><|21.20|> dan<|21.28|><|21.34|> juga<|21.50|><|21.58|> Alia<|21.70|><|21.76|> Mbah<|21.88|><|21.92|> Ahmad,<|22.08|><|22.22|> pembangunan<|22.66|><|22.72|> sistem<|23.00|><|23.06|> pendaftaran<|23.48|><|23.54|> warga<|23.72|><|23.78|> emas<|23.98|><|24.06|> ataupun<|24.36|><|24.42|> kita<|24.56|><|24.62|> sebutkan<|24.94|><|25.08|> event<|25.38|><|25.86|> adalah<|26.10|><|26.18|> usaha<|26.46|><|26.60|> kerajaan<|27.06|><|27.16|> kearah<|27.44|><|27.50|> merealisasikan<|28.36|><|28.86|> objektif<|29.36|><|29.42|> yang<|29.52|><|29.56|> telah<|29.72|><|29.76|> digarakan<|30.00|><|endoftext|>

Make sure you already monkey patched tokenization_whisper.TASK_IDS = ["translate", "transcribe", "transcribeprecise"] at starting of your script.