|
Metadata-Version: 2.1 |
|
Name: matcha-tts |
|
Version: 0.0.5.1 |
|
Summary: 🍵 Matcha-TTS: A fast TTS architecture with conditional flow matching |
|
Home-page: https://shivammehta25.github.io/Matcha-TTS |
|
Author: Shivam Mehta |
|
Author-email: [email protected] |
|
Requires-Python: >=3.9.0 |
|
Description-Content-Type: text/markdown |
|
License-File: LICENSE |
|
Requires-Dist: torch>=2.0.0 |
|
Requires-Dist: torchvision>=0.15.0 |
|
Requires-Dist: lightning>=2.0.0 |
|
Requires-Dist: torchmetrics>=0.11.4 |
|
Requires-Dist: hydra-core==1.3.2 |
|
Requires-Dist: hydra-colorlog==1.2.0 |
|
Requires-Dist: hydra-optuna-sweeper==1.2.0 |
|
Requires-Dist: rootutils |
|
Requires-Dist: pre-commit |
|
Requires-Dist: rich |
|
Requires-Dist: pytest |
|
Requires-Dist: phonemizer |
|
Requires-Dist: tensorboard |
|
Requires-Dist: librosa |
|
Requires-Dist: Cython |
|
Requires-Dist: numpy |
|
Requires-Dist: einops |
|
Requires-Dist: inflect |
|
Requires-Dist: Unidecode |
|
Requires-Dist: scipy |
|
Requires-Dist: torchaudio |
|
Requires-Dist: matplotlib |
|
Requires-Dist: pandas |
|
Requires-Dist: conformer==0.3.2 |
|
Requires-Dist: diffusers==0.25.0 |
|
Requires-Dist: notebook |
|
Requires-Dist: ipywidgets |
|
Requires-Dist: gradio==3.43.2 |
|
Requires-Dist: gdown |
|
Requires-Dist: wget |
|
Requires-Dist: seaborn |
|
Requires-Dist: piper_phonemize |
|
|
|
<div align="center"> |
|
|
|
|
|
|
|
|
|
|
|
[](https://www.python.org/downloads/release/python-3100/) |
|
[](https://pytorch.org/get-started/locally/) |
|
[](https://pytorchlightning.ai/) |
|
[](https://hydra.cc/) |
|
[](https://black.readthedocs.io/en/stable/) |
|
[](https://pycqa.github.io/isort/) |
|
|
|
<p style="text-align: center;"> |
|
<img src="https://shivammehta25.github.io/Matcha-TTS/images/logo.png" height="128"/> |
|
</p> |
|
|
|
</div> |
|
|
|
> This is the official code implementation of 🍵 Matcha-TTS [ICASSP 2024]. |
|
|
|
We propose 🍵 Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses [conditional flow matching](https://arxiv.org/abs/2210.02747) (similar to [rectified flows](https://arxiv.org/abs/2209.03003)) to speed up ODE-based speech synthesis. Our method: |
|
|
|
- Is probabilistic |
|
- Has compact memory footprint |
|
- Sounds highly natural |
|
- Is very fast to synthesise from |
|
|
|
Check out our [demo page](https://shivammehta25.github.io/Matcha-TTS) and read [our ICASSP 2024 paper](https://arxiv.org/abs/2309.03199) for more details. |
|
|
|
[Pre-trained models](https://drive.google.com/drive/folders/17C_gYgEHOxI5ZypcfE_k1piKCtyR0isJ?usp=sharing) will be automatically downloaded with the CLI or gradio interface. |
|
|
|
You can also [try 🍵 Matcha-TTS in your browser on HuggingFace 🤗 spaces](https://huggingface.co/spaces/shivammehta25/Matcha-TTS). |
|
|
|
|
|
|
|
[](https://youtu.be/xmvJkz3bqw0) |
|
|
|
|
|
|
|
1. Create an environment (suggested but optional) |
|
|
|
``` |
|
conda create -n matcha-tts python=3.10 -y |
|
conda activate matcha-tts |
|
``` |
|
|
|
2. Install Matcha TTS using pip or from source |
|
|
|
```bash |
|
pip install matcha-tts |
|
``` |
|
|
|
from source |
|
|
|
```bash |
|
pip install git+https://github.com/shivammehta25/Matcha-TTS.git |
|
cd Matcha-TTS |
|
pip install -e . |
|
``` |
|
|
|
3. Run CLI / gradio app / jupyter notebook |
|
|
|
```bash |
|
|
|
matcha-tts --text "<INPUT TEXT>" |
|
``` |
|
|
|
or |
|
|
|
```bash |
|
matcha-tts-app |
|
``` |
|
|
|
or open `synthesis.ipynb` on jupyter notebook |
|
|
|
|
|
|
|
- To synthesise from given text, run: |
|
|
|
```bash |
|
matcha-tts --text "<INPUT TEXT>" |
|
``` |
|
|
|
- To synthesise from a file, run: |
|
|
|
```bash |
|
matcha-tts --file <PATH TO FILE> |
|
``` |
|
|
|
- To batch synthesise from a file, run: |
|
|
|
```bash |
|
matcha-tts --file <PATH TO FILE> --batched |
|
``` |
|
|
|
Additional arguments |
|
|
|
- Speaking rate |
|
|
|
```bash |
|
matcha-tts --text "<INPUT TEXT>" --speaking_rate 1.0 |
|
``` |
|
|
|
- Sampling temperature |
|
|
|
```bash |
|
matcha-tts --text "<INPUT TEXT>" --temperature 0.667 |
|
``` |
|
|
|
- Euler ODE solver steps |
|
|
|
```bash |
|
matcha-tts --text "<INPUT TEXT>" --steps 10 |
|
``` |
|
|
|
|
|
|
|
Let's assume we are training with LJ Speech |
|
|
|
1. Download the dataset from [here](https://keithito.com/LJ-Speech-Dataset/), extract it to `data/LJSpeech-1.1`, and prepare the file lists to point to the extracted data like for [item 5 in the setup of the NVIDIA Tacotron 2 repo](https://github.com/NVIDIA/tacotron2#setup). |
|
|
|
2. Clone and enter the Matcha-TTS repository |
|
|
|
```bash |
|
git clone https://github.com/shivammehta25/Matcha-TTS.git |
|
cd Matcha-TTS |
|
``` |
|
|
|
3. Install the package from source |
|
|
|
```bash |
|
pip install -e . |
|
``` |
|
|
|
4. Go to `configs/data/ljspeech.yaml` and change |
|
|
|
```yaml |
|
train_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt |
|
valid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txt |
|
``` |
|
|
|
5. Generate normalisation statistics with the yaml file of dataset configuration |
|
|
|
```bash |
|
matcha-data-stats -i ljspeech.yaml |
|
|
|
|
|
``` |
|
|
|
Update these values in `configs/data/ljspeech.yaml` under `data_statistics` key. |
|
|
|
```bash |
|
data_statistics: |
|
mel_mean: -5.536622 |
|
mel_std: 2.116101 |
|
``` |
|
|
|
to the paths of your train and validation filelists. |
|
|
|
6. Run the training script |
|
|
|
```bash |
|
make train-ljspeech |
|
``` |
|
|
|
or |
|
|
|
```bash |
|
python matcha/train.py experiment=ljspeech |
|
``` |
|
|
|
- for a minimum memory run |
|
|
|
```bash |
|
python matcha/train.py experiment=ljspeech_min_memory |
|
``` |
|
|
|
- for multi-gpu training, run |
|
|
|
```bash |
|
python matcha/train.py experiment=ljspeech trainer.devices=[0,1] |
|
``` |
|
|
|
7. Synthesise from the custom trained model |
|
|
|
```bash |
|
matcha-tts --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT> |
|
``` |
|
|
|
|
|
|
|
> Special thanks to [@mush42](https://github.com/mush42) for implementing ONNX export and inference support. |
|
|
|
It is possible to export Matcha checkpoints to [ONNX](https://onnx.ai/), and run inference on the exported ONNX graph. |
|
|
|
|
|
|
|
To export a checkpoint to ONNX, first install ONNX with |
|
|
|
```bash |
|
pip install onnx |
|
``` |
|
|
|
then run the following: |
|
|
|
```bash |
|
python3 -m matcha.onnx.export matcha.ckpt model.onnx --n-timesteps 5 |
|
``` |
|
|
|
Optionally, the ONNX exporter accepts **vocoder-name** and **vocoder-checkpoint** arguments. This enables you to embed the vocoder in the exported graph and generate waveforms in a single run (similar to end-to-end TTS systems). |
|
|
|
**Note** that `n_timesteps` is treated as a hyper-parameter rather than a model input. This means you should specify it during export (not during inference). If not specified, `n_timesteps` is set to **5**. |
|
|
|
**Important**: for now, torch>=2.1.0 is needed for export since the `scaled_product_attention` operator is not exportable in older versions. Until the final version is released, those who want to export their models must install torch>=2.1.0 manually as a pre-release. |
|
|
|
|
|
|
|
To run inference on the exported model, first install `onnxruntime` using |
|
|
|
```bash |
|
pip install onnxruntime |
|
pip install onnxruntime-gpu |
|
``` |
|
|
|
then use the following: |
|
|
|
```bash |
|
python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs |
|
``` |
|
|
|
You can also control synthesis parameters: |
|
|
|
```bash |
|
python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --temperature 0.4 --speaking_rate 0.9 --spk 0 |
|
``` |
|
|
|
To run inference on **GPU**, make sure to install **onnxruntime-gpu** package, and then pass `--gpu` to the inference command: |
|
|
|
```bash |
|
python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --gpu |
|
``` |
|
|
|
If you exported only Matcha to ONNX, this will write mel-spectrogram as graphs and `numpy` arrays to the output directory. |
|
If you embedded the vocoder in the exported graph, this will write `.wav` audio files to the output directory. |
|
|
|
If you exported only Matcha to ONNX, and you want to run a full TTS pipeline, you can pass a path to a vocoder model in `ONNX` format: |
|
|
|
```bash |
|
python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --vocoder hifigan.small.onnx |
|
``` |
|
|
|
This will write `.wav` audio files to the output directory. |
|
|
|
|
|
|
|
If you use our code or otherwise find this work useful, please cite our paper: |
|
|
|
```text |
|
@inproceedings{mehta2024matcha, |
|
title={Matcha-{TTS}: A fast {TTS} architecture with conditional flow matching}, |
|
author={Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz{\'e}kely, {\'E}va and Henter, Gustav Eje}, |
|
booktitle={Proc. ICASSP}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
|
|
|
|
Since this code uses [Lightning-Hydra-Template](https://github.com/ashleve/lightning-hydra-template), you have all the powers that come with it. |
|
|
|
Other source code we would like to acknowledge: |
|
|
|
- [Coqui-TTS](https://github.com/coqui-ai/TTS/tree/dev): For helping me figure out how to make cython binaries pip installable and encouragement |
|
- [Hugging Face Diffusers](https://huggingface.co/): For their awesome diffusers library and its components |
|
- [Grad-TTS](https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS): For the monotonic alignment search source code |
|
- [torchdyn](https://github.com/DiffEqML/torchdyn): Useful for trying other ODE solvers during research and development |
|
- [labml.ai](https://nn.labml.ai/transformers/rope/index.html): For the RoPE implementation |
|
|