How to use

See example of inference pipeline for Russian TTS (G2P + FastPitch + HifiGAN) in this notebook. Or use this bash-script.

Input

This model is indended to be used in a G2P + FastPitch + HifiGAN pipeline (see above). If run independently, it expects text converted to IPA-like transcriptions. See this g2p model for conversion of plain Russian words to phonemes, or this new IPA-compatible G2P tool that can handle ambiguitity on sentence level. If you feed plain text directly, this FastPitch model will work, but quality will be low.

Output

This model generates mel spectrograms.

Training

The NeMo toolkit [1] was used for training the model for 1000+ epochs. Full training script is here

Datasets

This model is trained on RUSLAN [2] corpus (single speaker, male voice) sampled at 22050Hz.

References

  • [1] NVIDIA NeMo Toolkit
  • [2] Gabdrakhmanov L., Garaev R., Razinkov E. (2019) RUSLAN: Russian Spoken Language Corpus for Speech Synthesis. In: Salah A., Karpov A., Potapova R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science, vol 11658. Springer, Cham
Downloads last month
8
Inference Examples
Inference API (serverless) does not yet support nemo models for this pipeline type.