NVIDIA FastConformer-Hybrid Large (arm)

| |

This model transcribes speech to Armenian without punctuation and capitalization. It is a "large" version of the FastConformer Transducer-CTC model with approximately 115M parameters. This hybrid model is trained on two losses: Transducer (default) and CTC. See the model architecture section and NeMo documentation for complete architecture details.

NVIDIA NeMo: Training

To train, fine-tune or play with the model, you will need to install NVIDIA NeMo. We recommend you install it after you've installed the latest Pytorch version.

pip install nemo_toolkit['all']

How to Use this Model

The model is available for use in the NeMo toolkit and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.

Automatically instantiate the model

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModel.from_pretrained(model_name="mheryerznka/stt_arm_fastconformer_hybrid_large_no_pc")

Transcribing using Python

First, let's get a sample:

wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1Np_gMOeSac-Yc8GZ-yrq2xq9wsl7zT1_' -O hy_am-test-26-audio-audio.wav

Then simply do:

asr_model.transcribe(['hy_am-test-26-audio-audio.wav'])

Transcribing many audio files

Using Transducer mode inference:

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
 pretrained_name="mheryerznka/stt_arm_fastconformer_hybrid_large_no_pc" \
 audio_dir="<DIRECTORY CONTAINING AUDIO FILES>"

Using CTC mode inference:

python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \
 pretrained_name="mheryerznka/stt_arm_fastconformer_hybrid_large_no_pc" \
 audio_dir="<DIRECTORY CONTAINING AUDIO FILES>" \
 decoder_type="ctc"

Input

This model accepts 16000 Hz Mono-channel Audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given audio sample.

Model Architecture

FastConformer is an optimized version of the Conformer model with 8x depthwise-separable convolutional downsampling. The model is trained in a multitask setup with joint Transducer and CTC decoder loss. You may find more information on the details of FastConformer here: Fast-Conformer Model and about Hybrid Transducer-CTC training here: Hybrid Transducer-CTC.

Training

The NeMo toolkit was used for training the models 50 epochs on A100 GPUs at Yerevan State University. These models are trained with this example script and this base config.

The training process also incorporated a technique called slimIPL (slim Iterative Pseudo-Labeling), which involves self-training with intermediate pseudo-labels. The slimIPL algorithm uses pseudo-labels generated from high-confidence unlabeled data from youtube to iteratively refine the model.

Datasets

The model in this collection is trained on a composite dataset comprising of several hundred of Armenian speech:

Mozilla Common Voice 17.0
Google Fleurs
145 hours of unlabeled open-source Armenian audio from YouTube Youtube Audio Processing PL

Performance

The performance of Automatic Speech Recognition models is measured using Word Error Rate (WER). This model was specifically designed to handle the complexities of the Armenian language. The following tables summarize the performance of the available models in this collection with the RNN-Transducer decoder and CTC decoder. Performances of the ASR models are reported in terms of WER.

On data without Punctuation and Capitalization with Transducer decoder

Vocabulary Size	MCV17 TEST RNN-T	MCV17 TEST CTC	GOOGLE FLEURS TEST RNN-T	GOOGLE FLEURS TEST CTC
256	9.03	10.77	7.41	9.09

Limitations

Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech that includes technical terms or vernacular that the model has not been trained on especially western armenian. The model might also perform worse for accented speech.

mheryerznka
/

stt_arm_fastconformer_hybrid_large_no_pc

You need to agree to share your contact information to access this model