Introduce
Voice activity detection (VAD) plays a important role in speech recognition systems by detecting the beginning and end of effective speech. FunASR provides an efficient VAD model based on the FSMN structure. To improve model discrimination, we use monophones as modeling units, given the relatively rich speech information. During inference, the VAD system requires post-processing for improved robustness, including operations such as threshold settings and sliding windows.
This repository demonstrates how to leverage FSMN-VAD in conjunction with the funasr_onnx runtime. The underlying model is derived from FunASR, which was trained on a massive 5,000-hour dataset.
We have relesed numerous industrial-grade models, including speech recognition, voice activity detection, punctuation restoration, speaker verification, speaker diarization, and timestamp prediction (force alignment). To learn more about these models, kindly refer to the documentation available on FunASR. If you are interested in leveraging advanced AI technology for your speech-related projects, we invite you to explore the possibilities offered by FunASR.
Install funasr_onnx
pip install -U funasr_onnx
# For the users in China, you could install with the command:
# pip install -U funasr_onnx -i https://mirror.sjtu.edu.cn/pypi/web/simple
Download the model
git lfs install
git clone https://huggingface.co/funasr/FSMN-VAD
Inference with runtime
Voice Activity Detection
FSMN-VAD
from funasr_onnx import Fsmn_vad
model_dir = "./FSMN-VAD"
model = Fsmn_vad(model_dir, quantize=True)
wav_path = "./FSMN-VAD/asr_example.wav"
result = model(wav_path)
print(result)
model_dir
: the model path, which containsmodel.onnx
,config.yaml
,am.mvn
batch_size
:1
(Default), the batch size duration inferencedevice_id
:-1
(Default), infer on CPU. If you want to infer with GPU, set it to gpu_id (Please make sure that you have install the onnxruntime-gpu)quantize
:False
(Default), load the model ofmodel.onnx
inmodel_dir
. If setTrue
, load the model ofmodel_quant.onnx
inmodel_dir
intra_op_num_threads
:4
(Default), sets the number of threads used for intraop parallelism on CPU
Input: wav formt file, support formats: str, np.ndarray, List[str]
Output: List[str]
: recognition result
Citations
@inproceedings{gao2022paraformer,
title={Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition},
author={Gao, Zhifu and Zhang, Shiliang and McLoughlin, Ian and Yan, Zhijie},
booktitle={INTERSPEECH},
year={2022}
}