ClassicVC

ClassicVC is an any-to-any voice conversion model that enables users to design their original speaker styles by selecting the coordinates from the continuous latent spaces. The model components are implemented using PyTorch and fully compatible with ONNX.

MMCXLI provides the dedicated graphical user interface (GUI) for ClassicVC. It runs on wxPython and ONNX Runtime. Users can download the ONNX files and try out speech conversion without having to install PyTorch or train a model with their own voice data.

Model Details

Model Description

Developed by: Lyodos (Lyodos the City of the Museum)

Model Sources

Repository: GitHub

Uses

Based on the MIT License, users can use the model codes and checkpoints for research purpose. It is provided with no guarantees.

Direct Use

MMCXLI

Out-of-Scope Use

This model was prototyped as a hobbyist's research into any-to-any voice conversion, and we make no guarantees especially regarding its reliability or real-time operation.

As for use in situations involving an unspecified number of people, such as web broadcasting, and mission-critical applications, including medical, transportation, infrastructure, and weapon systems, we cannot prohibit such use as the developer, since the MIT License is the only stated license, but we do not encourage it.

[More Information Needed]

Bias, Risks, and Limitations

We used three large-scale speech corpora (LibriSpeech, Samrómur Children 21.09, and VoxCeleb 1 and 2) to make the latent space of speakers that can be embedded using the style encoder of ClassicVC as inclusive as possible of all natural human voice.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

The Notebook 01 of the ClassicVC repository provides the procedure for offline (non real-time) voice conversion.

The MMCXLI repository provides GUI, which depends on local Python environment.

Training Details

Training Data

The model checkpoints provided here were trained on the following three datasets.

LibriSpeech ASR corpus

V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 2015, pp. 5206-5210, doi: 10.1109/ICASSP.2015.7178964.
https://ieeexplore.ieee.org/document/7178964
https://openslr.org/12/

Samrómur Children 21.09

Mena, Carlos; et al., 2021, Samromur Children 21.09, CLARIN-IS, http://hdl.handle.net/20.500.12537/185.
https://repository.clarin.is/repository/xmlui/handle/20.500.12537/185
https://openslr.org/117/

VoxCeleb 1 and 2

A. Nagrani*, J. S. Chung*, A. Zisserman, "VoxCeleb: a large-scale speaker identification dataset", Interspeech 2017
J. S. Chung*, A. Nagrani*, A. Zisserman, "VoxCeleb2: Deep Speaker Recognition", Interspeech 2018
A. Nagrani*, J. S. Chung*, W. Xie, A. Zisserman, "VoxCeleb: Large-scale speaker verification in the wild", Computer Speech and Language, 2019
https://huggingface.co/datasets/ProgramComputer/voxceleb/tree/main/vox2

Training Procedure

The Notebook 02 of the ClassicVC repository provides the procedure for data preparation.

The Notebook 03 of the ClassicVC repository provides the training code.