Spaces:
Running
Running
kevinwang676
commited on
Commit
·
74e658e
1
Parent(s):
06cf5cf
Update README.md
Browse files
README.md
CHANGED
@@ -1,118 +1,13 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
3 |
-
<a href='https://thuhcsi.github.io/NeuCoSVC/'><img src='https://img.shields.io/badge/Demo-green'></a>
|
4 |
-
<a href='https://arxiv.org/abs/2312.04919'><img src='https://img.shields.io/badge/Paper-red'></a>
|
5 |
-
[![GitHub](https://img.shields.io/github/stars/thuhcsi/NeuCoSVC?style=social)](https://github.com/thuhcsi/NeuCoSVC)
|
6 |
-
|
7 |
-
This is the official implementation of NeuCoSVC, an any-to-any singing voice conversion model from our [paper](https://arxiv.org/abs/2312.04919).
|
8 |
-
Audio samples are available at [https://thuhcsi.github.io/NeuCoSVC/](https://thuhcsi.github.io/NeuCoSVC/). The trained checkpoints are available from [google drive](https://drive.google.com/file/d/1QjoQ6mt7-OZPHF4X20TXbikYdg8NlepR/view?usp=drive_link).
|
9 |
-
|
10 |
-
![NeuCoSVC](./img/Architecture.png)
|
11 |
-
|
12 |
-
Figure: The structure of the proposed SVC system: (a) the SSL feature extracting and matching module; (b) the neural harmonic signal generator; (c) the audio synthesizer.
|
13 |
-
|
14 |
-
## Setup
|
15 |
-
|
16 |
-
### Environment
|
17 |
-
|
18 |
-
We recommend installing the project's environment using Anaconda. The `requirements.txt` file contains a curated list of dependencies for the developing environment(Torch 2.0.1 + cu117). You can use the following commands to set up the environment:
|
19 |
-
|
20 |
-
```bash
|
21 |
-
conda create -n NeuCoSVC python=3.10.6
|
22 |
-
conda activate NeuCoSVC
|
23 |
-
pip install -r requirements.txt
|
24 |
-
```
|
25 |
-
|
26 |
-
Additionally, you can find the complete original environment used for developing in the `requirements_all.txt` file.
|
27 |
-
|
28 |
-
Besides, [REAPER](https://github.com/google/REAPER) is required for pitch extraction. You need to download and install REAPER, and then modify the path to REAPER in [utils/pitch_ld_extraction.py](utils/pitch_ld_extraction.py)
|
29 |
-
|
30 |
-
### Checkpoints
|
31 |
-
|
32 |
-
The checkpoint for the frozen WavLM Large encoder can be obtained from the [original WavLM repository](https://github.com/microsoft/unilm/tree/master/wavlm).
|
33 |
-
|
34 |
-
The trained FastSVC model with neural harmonic filters can be downloaded from [google drive](https://drive.google.com/file/d/1QjoQ6mt7-OZPHF4X20TXbikYdg8NlepR/view?usp=drive_link)
|
35 |
-
|
36 |
-
Then you need to put the WavLM-Large.pt file and model.pkl folder in the `pretrained` folder.
|
37 |
-
|
38 |
-
## Inference
|
39 |
-
|
40 |
-
Note that the source waveform must be 24kHz. `--speech_enroll` is recommended when using speech as the reference audio, and the pitch of the reference audio will be increased to 1.2 times when performing a pitch shift to cover the pitch gap between singing and speech.
|
41 |
-
|
42 |
-
```bash
|
43 |
-
python infer.py --src_wav_path src-wav-path --ref_wav_path ref-wav-path --out_path out-path --speech_enroll
|
44 |
-
```
|
45 |
-
|
46 |
-
## Training
|
47 |
-
|
48 |
-
### Data Preparation
|
49 |
-
|
50 |
-
Take the OpenSinger dataset as an example, the dataset needs to be **resampled to 24kHz**.
|
51 |
-
|
52 |
-
```
|
53 |
-
- OpenSinger_24k
|
54 |
-
|- ManRaw/
|
55 |
-
| | - SingerID_SongName/
|
56 |
-
| | | - SingerID_SongName_SongClipNumber.wav/
|
57 |
-
| | | - ...
|
58 |
-
| | - ...
|
59 |
-
|- WomanRaw/
|
60 |
-
| | - 0_光年之外/
|
61 |
-
| | | - 0_光年之外_0.wav/
|
62 |
-
| | | - ...
|
63 |
-
| | - ...
|
64 |
-
```
|
65 |
-
|
66 |
-
Then perform data preprocessing.
|
67 |
-
|
68 |
-
1. Extract pitch and loudness. Specify the directories for pitch and loudness using the `--pitch_dir` and `--ld_dir` parameters respectively. If not specified, the features will be saved in the `pitch`/`loudness` folder under the `dataset-root` directory.
|
69 |
-
|
70 |
-
```bash
|
71 |
-
python -m utils.pitch_ld_extraction --data_root dataset-root --pitch_dir dir-for-pitch --ld_dir dir-for-loudness --n_cpu 8
|
72 |
-
```
|
73 |
-
|
74 |
-
2. Extract pre-matching features of each audio piece. The program uses the average of the last five layers' features from WavLM for distance calculation and kNN. It replaces and concatenates on the corresponding feature of the 6th layer in WavLM for audio synthesis. This configuration has shown improved performance in experiments. If `--out_dir` is not specified, the features will be saved in the wavlm_features folder under the dataset-root directory.
|
75 |
-
|
76 |
-
```bash
|
77 |
-
python -m dataset.prematch_dataset --data_root dataset-root --out_dir dir-for-wavlm-feats
|
78 |
-
```
|
79 |
-
|
80 |
-
3. Split the dataset into train, valid, and test sets, and generate the metadata files. By default, singing audio clips from the 26th and 27th male singers(OpenSinger/ManRaw/26(7)\_\*/\*.wav) and 46th and 47th female singers(OpenSinger/WomanRaw/46(7)\_\*/\*.wav) are considered as the test set. The remaining singers' audio files are randomly divided into the train set and the valid set in a 9:1 ratio. Specify the directories for features using the `--wavlm_dir`, `--pitch_dir`, and `--ld_dir` parameters. If not specified, the corresponding features will be read from the `wavlm_features`, `pitch`, and `loudness` folders under the `data_root` directory.
|
81 |
-
|
82 |
-
```bash
|
83 |
-
python dataset/metadata.py --data_root dataset-root
|
84 |
-
```
|
85 |
-
|
86 |
-
### Train Decoder
|
87 |
-
|
88 |
-
```bash
|
89 |
-
# for single GPU training:
|
90 |
-
python start.py --data_root dataset-dir --config configs/config.json --cp_path pretrained
|
91 |
-
# for distributed multi GPUs training:
|
92 |
-
torchrun --nnodes=1 --nproc_per_node=4 start.py --data_root dataset-dir --config configs/config.json --cp_path pretrained
|
93 |
-
```
|
94 |
-
|
95 |
-
To modify the training configurations or model parameters, you can edit the `configs/config.json` file.
|
96 |
-
|
97 |
-
## Acknowledgements
|
98 |
-
|
99 |
-
This work is inspired by [kNN-VC](https://github.com/bshall/knn-vc/tree/master) and built on the [U-net SVC](https://www.isca-speech.org/archive/interspeech_2022/li22da_interspeech.html) frameworks.
|
100 |
-
|
101 |
-
We have incorporated publicly available code from the [kNN-VC](https://github.com/bshall/knn-vc/tree/master) and [WavLM](https://github.com/microsoft/unilm/tree/master/wavlm) projects.
|
102 |
-
|
103 |
-
We would like to express our gratitude to the authors of kNN-VC and WavLM for sharing their codebases. Their contributions have been instrumental in the development of our project.
|
104 |
-
|
105 |
-
## Citation
|
106 |
-
|
107 |
-
If this repo is helpful with your research or projects, please kindly star our repo and cite our paper as follows:
|
108 |
-
|
109 |
-
```bibtex
|
110 |
-
@misc{sha2023neural,
|
111 |
-
title={neural concatenative singing voice conversion: rethinking concatenation-based approach for one-shot singing voice conversion},
|
112 |
-
author={Binzhu Sha and Xu Li and Zhiyong Wu and Ying Shan and Helen Meng},
|
113 |
-
year={2023},
|
114 |
-
eprint={2312.04919},
|
115 |
-
archivePrefix={arXiv},
|
116 |
-
primaryClass={cs.SD}
|
117 |
-
}
|
118 |
-
```
|
|
|
1 |
+
---
|
2 |
+
title: NeuCoSVC
|
3 |
+
emoji: 🚀
|
4 |
+
colorFrom: pink
|
5 |
+
colorTo: purple
|
6 |
+
sdk: gradio
|
7 |
+
sdk_version: 3.36.0
|
8 |
+
app_file: app.py
|
9 |
+
pinned: false
|
10 |
+
license: mit
|
11 |
+
---
|
12 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
13 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|