NeMo

CHiME8 DASR NeMo Baseline Models

1. Voice Activity Detection (VAD) Model:

MarbleNet_frame_VAD_chime7_Acrobat.nemo

2. Speaker Diarization Model: Multi-scale Diarization Decoder (MSDD-v2)

MSDD_v2_PALO_100ms_intrpl_3scales.nemo

Our DASR system is based on the speaker diarization system using the multi-scale diarization decoder (MSDD).

  • MSDD Reference: Park et al. (2022)
  • MSDD-v2 speaker diarization system employs a multi-scale embedding approach and utilizes TitaNet speaker embedding extractor.
  • Unlike the system that uses a multi-layer LSTM architecture, we employ a four-layer Transformer architecture with a hidden size of 384.
  • This neural model generates logit values indicating speaker existence.
  • Our diarization model is trained on approximately 3,000 hours of simulated audio mixture data from the same multi-speaker data simulator used in VAD model training, drawing from VoxCeleb1&2 and LibriSpeech datasets.
  • MUSAN noise is also used for adding additive background noise, focusing on music and broadband noise.

3. Automatic Speech Recognition (ASR) model

FastConformerXL-RNNT-chime7-GSS-finetuned.nemo

  • This ASR model is based on NeMo FastConformer XL model.
  • Single-channel audio generated using a multi-channel front-end (Guided Source Separation, GSS) is transcribed using a 0.6B parameter Conformer-based transducer (RNNT) model.
  • The model was initialized using a publicly available NeMo checkpoint.
  • This model was then fine-tuned on the CHiME-7 train and dev set, which includes the CHiME-6 and Mixer6 training subsets, after processing the data through the multi-channel ASR front-end, utilizing ground-truth diarization.
    • Fine-Tuning Details:
      • Fine-tuning Duration: 35,000 updates
      • Batch Size: 128

4. Language Model for ASR Decoding: KenLM Model

ASR_LM_chime7_only.kenlm

  • This KenLM model is trained solely on CHiME7-DASR datasets (Mixer6, CHiME6, DipCo).
  • We apply a word-piece level N-gram language model using byte-pair-encoding (BPE) tokens.
  • This approach utilizes the SentencePiece and KenLM toolkits, based on the transcription of CHiME-7 train and dev sets.
  • The token sets of our ASR and LM models were matched to ensure consistency.
  • To combine several N-gram models with equal weights, we used the OpenGrm library.
  • MAES decoding was employed for the transducer, which accelerates the decoding process.
  • As expected, integrating the beam-search decoder with the language model significantly enhances the performance of the end-to-end model compared to its pure counterpart.
Downloads last month
92
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.