Step by step on how to use language model KenLM with the model
#1
by
huseinzol05
- opened
Very simple actually,
- install necessary libraries, I would like to choose
pyctcdecode
,
pip3 install pyctcdecode==0.1.0 pypi-kenlm==0.1.20210121
The version is very important, if you try to bump pyctcdecode
above 0.1.0
, steps below are no longer working.
- Download language model,
wget https://huggingface.co/huseinzol05/language-model-bahasa-manglish-combined/resolve/main/model.klm
Read https://github.com/huseinzol05/malaya-speech/blob/master/pretrained-model/prepare-lm/build-lm-mixed-combined.ipynb how to create your own language model.
- Load the model and language model,
from transformers import AutoModelForCTC
from pyctcdecode import build_ctcdecoder
import kenlm
kenlm_model = kenlm.Model('model.klm')
decoder = build_ctcdecoder(
unique_vocab,
kenlm_model,
alpha=0.2,
beta=1.0,
ctc_token_idx=tokenizer.pad_token_id
)
model = AutoModelForCTC.from_pretrained(
'mesolitica/wav2vec2-xls-r-300m-mixed',
)
o_pt = model(inputs)
o_pt = o_pt.logits.detach().cpu().numpy()
out = decoder.decode_beams(o_pt[0], prune_history=True)