How to use

See example of inference pipeline for Russian TTS (G2P + FastPitch + HifiGAN) in this notebook. Or use this bash-script.

Input

This model accepts batches of mel spectrograms.

Output

This model outputs audio at 22050Hz.

Training

The NeMo toolkit [1] was used for training the model for several epochs. Full training script is here.

Datasets

This model is trained on RUSLAN [2] corpus (single speaker, male voice) sampled at 22050Hz.

References

  • [1] NVIDIA NeMo Toolkit
  • [2] Gabdrakhmanov L., Garaev R., Razinkov E. (2019) RUSLAN: Russian Spoken Language Corpus for Speech Synthesis. In: Salah A., Karpov A., Potapova R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science, vol 11658. Springer, Cham
Downloads last month
7
Inference Examples
Inference API (serverless) does not yet support nemo models for this pipeline type.