Spaces:
Runtime error
A newer version of the Streamlit SDK is available:
1.41.1
Real-Time Voice Cloning v2
What is this?
It is an improved version of Real-Time-Voice-Cloning. Our emotion voice cloning implementation is here!
Installation
Install ffmpeg. This is necessary for reading audio files.
Create a new conda environment with
conda create -n rtvc python=3.7.13
Install PyTorch. Pick the proposed CUDA version if you have a GPU, otherwise pick CPU. My torch version:
torch=1.9.1+cu111
torchvision=0.10.1+cu111
Install the remaining requirements with
pip install -r requirements.txt
- Install spaCy model en_core_web_sm by
python -m spacy download en_core_web_sm
Training
Encoder
Download dataset:
LibriSpeech: train-other-500 for training, dev-other for validation (extract as /LibriSpeech/)
VoxCeleb1: Dev A - D for training, Test for validation as well as the metadata file
vox1_meta.csv
(extract as /VoxCeleb1/ and /VoxCeleb1/vox1_meta.csv)VoxCeleb2: Dev A - H for training, Test for validation (extract as /VoxCeleb2/)
Encoder preprocessing:
python encoder_preprocess.py <datasets_root>
Encoder training:
it is recommended to start visdom server for monitor training with
visdom
then start training with
python encoder_train.py <model_id> <datasets_root>/SV2TTS/encoder
Synthesizer
Download dataset:
- LibriSpeech: train-clean-100 and train-clean-360 for training, dev-clean for validation (extract as /LibriSpeech/)
- LibriSpeech alignments: merge the directory structure with the LibriSpeech datasets you have downloaded (do not take the alignments from the datasets you haven't downloaded else the scripts will think you have them)
- VCTK: used for training and validation
Synthesizer preprocessing:
python synthesizer_preprocess_audio.py <datasets_root>
python synthesizer_preprocess_embeds.py <datasets_root>/SV2TTS/synthesizer
Synthesizer training:
python synthesizer_train.py <model_id> <datasets_root>/SV2TTS/synthesizer --use_tb
if you want to monitor the training progress, run
tensorboard --logdir log/vc/synthesizer --host localhost --port 8088
Vocoder
Download dataset:
The same as synthesizer. You can skip this if you already download synthesizer training dataset.
Vocoder preprocessing:
python vocoder_preprocess.py <datasets_root>
Vocoder training:
python vocoder_train.py <model_id> <datasets_root> --use_tb
if you want to monitor the training progress, run
tensorboard --logdir log/vc/vocoder --host localhost --port 8080
Note:
Training breakpoints are saved periodically, so you can run the training command and resume training when the breakpoint exists.
Inference
Terminal:
python demo_cli.py
First input the number of audios, then input the audio file paths, then input the text message. The attention alignments and mel spectrogram are stored in syn_results/. The generated audio is stored in out_audios/.
GUI demo:
python demo_toolbox.py
Dimension reduction visualization
Download dataset:
LibriSpeech: test-other (extract as /LibriSpeech/)
Preprocessing:
python encoder_test_preprocess.py <datasets_root>
Visualization:
python encoder_test_visualization.py <model_id> <datasets_root>
The results are saved in dim_reduction_results/.
Pretrained models
You can download the pretrained model from this and extract as saved_models/default
Demo results
The audio results are here