fix readme
Browse files
README.md
CHANGED
@@ -1,116 +0,0 @@
|
|
1 |
-
A Vietnamese TTS
|
2 |
-
================
|
3 |
-
|
4 |
-
Duration model + Acoustic model + HiFiGAN vocoder for vietnamese text-to-speech application.
|
5 |
-
|
6 |
-
Online demo at https://huggingface.co/spaces/ntt123/vietTTS.
|
7 |
-
|
8 |
-
A synthesized audio clip: [clip.wav](assets/infore/clip.wav). A colab notebook: [notebook](https://colab.research.google.com/drive/1oczrWOQOr1Y_qLdgis1twSlNZlfPVXoY?usp=sharing).
|
9 |
-
|
10 |
-
|
11 |
-
🔔Checkout the experimental `multi-speaker` branch (`git checkout multi-speaker`) for multi-speaker support.🔔
|
12 |
-
|
13 |
-
Install
|
14 |
-
-------
|
15 |
-
|
16 |
-
|
17 |
-
```sh
|
18 |
-
git clone https://github.com/NTT123/vietTTS.git
|
19 |
-
cd vietTTS
|
20 |
-
pip3 install -e .
|
21 |
-
```
|
22 |
-
|
23 |
-
|
24 |
-
Quick start using pretrained models
|
25 |
-
----------------------------------
|
26 |
-
```sh
|
27 |
-
bash ./scripts/quick_start.sh
|
28 |
-
```
|
29 |
-
|
30 |
-
|
31 |
-
Download InfoRe dataset
|
32 |
-
-----------------------
|
33 |
-
|
34 |
-
```sh
|
35 |
-
python ./scripts/download_aligned_infore_dataset.py
|
36 |
-
```
|
37 |
-
|
38 |
-
**Note**: this is a denoised and aligned version of the original dataset which is donated by the InfoRe Technology company (see [here](https://www.facebook.com/groups/j2team.community/permalink/1010834009248719/)). You can download the original dataset (**InfoRe Technology 1**) at [here](https://github.com/TensorSpeech/TensorFlowASR/blob/main/README.md#vietnamese).
|
39 |
-
|
40 |
-
See `notebooks/denoise_infore_dataset.ipynb` for instructions on how to denoise the dataset. We use the Montreal Forced Aligner (MFA) to align transcript and speech (textgrid files).
|
41 |
-
See `notebooks/align_text_audio_infore_mfa.ipynb` for instructions on how to create textgrid files.
|
42 |
-
|
43 |
-
Train duration model
|
44 |
-
--------------------
|
45 |
-
|
46 |
-
```sh
|
47 |
-
python -m vietTTS.nat.duration_trainer
|
48 |
-
```
|
49 |
-
|
50 |
-
|
51 |
-
Train acoustic model
|
52 |
-
--------------------
|
53 |
-
```sh
|
54 |
-
python -m vietTTS.nat.acoustic_trainer
|
55 |
-
```
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
Train HiFiGAN vocoder
|
60 |
-
-------------
|
61 |
-
|
62 |
-
We use the original implementation from HiFiGAN authors at https://github.com/jik876/hifi-gan. Use the config file at `assets/hifigan/config.json` to train your model.
|
63 |
-
|
64 |
-
```sh
|
65 |
-
git clone https://github.com/jik876/hifi-gan.git
|
66 |
-
|
67 |
-
# create dataset in hifi-gan format
|
68 |
-
ln -sf `pwd`/train_data hifi-gan/data
|
69 |
-
cd hifi-gan/data
|
70 |
-
ls -1 *.TextGrid | sed -e 's/\.TextGrid$//' > files.txt
|
71 |
-
cd ..
|
72 |
-
head -n 100 data/files.txt > val_files.txt
|
73 |
-
tail -n +101 data/files.txt > train_files.txt
|
74 |
-
rm data/files.txt
|
75 |
-
|
76 |
-
# training
|
77 |
-
python train.py \
|
78 |
-
--config ../assets/hifigan/config.json \
|
79 |
-
--input_wavs_dir=data \
|
80 |
-
--input_training_file=train_files.txt \
|
81 |
-
--input_validation_file=val_files.txt
|
82 |
-
```
|
83 |
-
|
84 |
-
Finetune on Ground-Truth Aligned melspectrograms:
|
85 |
-
```sh
|
86 |
-
cd /path/to/vietTTS # go to vietTTS directory
|
87 |
-
python -m vietTTS.nat.zero_silence_segments -o train_data # zero all [sil, sp, spn] segments
|
88 |
-
python -m vietTTS.nat.gta -o /path/to/hifi-gan/ft_dataset # create gta melspectrograms at hifi-gan/ft_dataset directory
|
89 |
-
|
90 |
-
# turn on finetune
|
91 |
-
cd /path/to/hifi-gan
|
92 |
-
python train.py \
|
93 |
-
--fine_tuning True \
|
94 |
-
--config ../assets/hifigan/config.json \
|
95 |
-
--input_wavs_dir=data \
|
96 |
-
--input_training_file=train_files.txt \
|
97 |
-
--input_validation_file=val_files.txt
|
98 |
-
```
|
99 |
-
|
100 |
-
Then, use the following command to convert pytorch model to haiku format:
|
101 |
-
```sh
|
102 |
-
cd ..
|
103 |
-
python -m vietTTS.hifigan.convert_torch_model_to_haiku \
|
104 |
-
--config-file=assets/hifigan/config.json \
|
105 |
-
--checkpoint-file=hifi-gan/cp_hifigan/g_[latest_checkpoint]
|
106 |
-
```
|
107 |
-
|
108 |
-
Synthesize speech
|
109 |
-
-----------------
|
110 |
-
|
111 |
-
```sh
|
112 |
-
python -m vietTTS.synthesizer \
|
113 |
-
--lexicon-file=train_data/lexicon.txt \
|
114 |
-
--text="hôm qua em tới trường" \
|
115 |
-
--output=clip.wav
|
116 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|