Pruned Stateless Zipformer RNN-T Streaming ID

Pruned Stateless Zipformer RNN-T Streaming ID is an automatic speech recognition model trained on the following datasets:

Instead of being trained to predict sequences of words, this model was trained to predict sequence of phonemes, e.g. ['p', 'ə', 'r', 'b', 'u', 'a', 't', 'a', 'n', 'ɲ', 'a']. Therefore, the model's vocabulary contains the different IPA phonemes found in g2p ID.

This model was trained using icefall framework. All training was done on a Scaleway RENDER-S VM with a Tesla P100 GPU. All necessary scripts used for training could be found in the Files and versions tab, as well as the Training metrics logged via Tensorboard.

Evaluation Results

Simulated Streaming

for m in greedy_search fast_beam_search modified_beam_search; do
  ./pruned_transducer_stateless7_streaming/decode.py \
    --epoch 30 \
    --avg 9 \
    --exp-dir ./pruned_transducer_stateless7_streaming/exp \
    --max-duration 600 \
    --decode-chunk-len 32 \
    --decoding-method $m
done

The model achieves the following phoneme error rates on the different test sets:

Decoding	LibriVox	FLEURS	Common Voice
Greedy Search	4.87%	11.45%	14.97%
Modified Beam Search	4.71%	11.25%	14.31%
Fast Beam Search	4.85%	12.55%	14.89%

Chunk-wise Streaming

for m in greedy_search fast_beam_search modified_beam_search; do
  ./pruned_transducer_stateless7_streaming/streaming_decode.py \
    --epoch 30 \
    --avg 9 \
    --exp-dir ./pruned_transducer_stateless7_streaming/exp \
    --decoding-method $m \
    --decode-chunk-len 32 \
    --num-decode-streams 1500
done

The model achieves the following phoneme error rates on the different test sets:

Decoding	LibriVox	FLEURS	Common Voice
Greedy Search	5.12%	12.74%	15.78%
Modified Beam Search	4.78%	11.83%	14.54%
Fast Beam Search	4.81%	12.93%	14.96%

Usage

Download Pre-trained Model

cd egs/bookbot/ASR
mkdir tmp
cd tmp
git lfs install
git clone https://huggingface.co/bookbot/pruned-transducer-stateless7-streaming-id

Inference

To decode with greedy search, run:

./pruned_transducer_stateless7_streaming/jit_pretrained.py \
  --nn-model-filename ./tmp/pruned-transducer-stateless7-streaming-id/exp/cpu_jit.pt \
  --lang-dir ./tmp/pruned-transducer-stateless7-streaming-id/data/lang_phone \
  ./tmp/pruned-transducer-stateless7-streaming-id/test_waves/sample1.wav

Decoding Output

2023-06-21 10:19:18,563 INFO [jit_pretrained.py:217] device: cpu
2023-06-21 10:19:19,231 INFO [lexicon.py:168] Loading pre-compiled tmp/pruned-transducer-stateless7-streaming-id/data/lang_phone/Linv.pt
2023-06-21 10:19:19,232 INFO [jit_pretrained.py:228] Constructing Fbank computer
2023-06-21 10:19:19,233 INFO [jit_pretrained.py:238] Reading sound files: ['./tmp/pruned-transducer-stateless7-streaming-id/test_waves/sample1.wav']
2023-06-21 10:19:19,234 INFO [jit_pretrained.py:244] Decoding started
2023-06-21 10:19:20,090 INFO [jit_pretrained.py:271] 
./tmp/pruned-transducer-stateless7-streaming-id/test_waves/sample1.wav:
p u l a ŋ | s ə k o l a h | p i t ə r i | s a ŋ a t | l a p a r


2023-06-21 10:19:20,090 INFO [jit_pretrained.py:273] Decoding Done

Training procedure

Install icefall

git clone https://github.com/bookbot-hive/icefall
cd icefall
export PYTHONPATH=`pwd`:$PYTHONPATH

Prepare Data

cd egs/bookbot_id/ASR
./prepare.sh

Train

export CUDA_VISIBLE_DEVICES="0"
./pruned_transducer_stateless7_streaming/train.py \
  --num-epochs 30 \
  --use-fp16 1 \
  --max-duration 400

bookbot
/

pruned-transducer-stateless7-streaming-id