|
--- |
|
language: |
|
- ca |
|
licence: |
|
- apache-2.0 |
|
tags: |
|
- matcha-tts |
|
- acoustic modelling |
|
- speech |
|
- multispeaker |
|
pipeline_tag: text-to-speech |
|
datasets: |
|
- projecte-aina/festcat_trimmed_denoised |
|
- projecte-aina/openslr-slr69-ca-trimmed-denoised |
|
--- |
|
|
|
# Matcha-TTS Catalan Multispeaker |
|
|
|
## Table of Contents |
|
<details> |
|
<summary>Click to expand</summary> |
|
|
|
- [Model description](#model-description) |
|
- [Intended uses and limitations](#intended-uses-and-limitations) |
|
- [How to use](#how-to-use) |
|
- [Training](#training) |
|
- [Evaluation](#evaluation) |
|
- [Citation](#citation) |
|
- [Additional information](#additional-information) |
|
|
|
</details> |
|
|
|
## Model description |
|
|
|
**Matcha-TTS** is an encoder-decoder architecture designed for fast acoustic modelling in TTS. The encoder predicts phoneme durations and its average acoustic features. |
|
And the decoder is essentially a U-Net inspired by [Grad-TTS](https://arxiv.org/pdf/2105.06337.pdf), that is based on Transformers architecture but combined |
|
with 1D instead of 2D CNNs, making a high reduction on memory consumption and speedy synthesis. |
|
|
|
**Matcha-TTS** is non-autorregressive model and is trained using optimal-transport conditional flow matching (OT-CFM). |
|
This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. |
|
|
|
## Intended uses and limitations |
|
|
|
This model is intended to serve as an acoustic feature generator for multispeaker text-to-speech systems for the Catalan language. |
|
It has been finetuned using a Catalan phonemizer, therefore if the model is used in other languages it may will not produce intelligible samples after converting its output |
|
into a speech waveform. |
|
|
|
The quality of the samples can vary depending on the speaker. |
|
This may be due to the sensitivity of the model in learning specific frequencies and also due to the samples used for each speaker. |
|
|
|
## How to use |
|
|
|
### Installation |
|
|
|
```bash |
|
pip install git+https://github.com/langtech-bsc/vocos.git@matcha |
|
``` |
|
You need to install the Catalan phonemizer version of espeak-ng: |
|
|
|
```bash |
|
git clone https://github.com/projecte-aina/espeak-ng.git |
|
|
|
export PYTHON=/path/to/env/<env_name>/bin/python |
|
cd /path/to/espeak-ng |
|
./autogen.sh |
|
./configure --prefix=/path/to/espeak-ng |
|
make |
|
make install |
|
|
|
pip cache purge |
|
pip install mecab-python3 |
|
pip install unidic-lite |
|
|
|
``` |
|
|
|
### Generate |
|
|
|
## Training |
|
|
|
### Adaptation |
|
|
|
|
|
### Training data |
|
|
|
The model was trained on 2 Catalan speech datasets |
|
|
|
| Dataset | Language | Hours | |
|
|---------------------|----------|---------| |
|
| Festcat | ca | 22 | |
|
| OpenSLR69 | ca | 5 | |
|
|
|
### Languages |
|
|
|
Data comes from two different datasets: festcat and openslr69 |
|
|
|
### Framework |
|
|
|
|
|
## Evaluation |
|
|
|
### Results |
|
|
|
## Citation |
|
|
|
If this code contributes to your research, please cite the work: |
|
|
|
``` |
|
@misc{mehta2024matchatts, |
|
title={Matcha-TTS: A fast TTS architecture with conditional flow matching}, |
|
author={Shivam Mehta and Ruibo Tu and Jonas Beskow and Éva Székely and Gustav Eje Henter}, |
|
year={2024}, |
|
eprint={2309.03199}, |
|
archivePrefix={arXiv}, |
|
primaryClass={eess.AS} |
|
} |
|
``` |
|
|
|
## Additional information |
|
|
|
### Author |
|
The Language Technologies Unit from Barcelona Supercomputing Center. |
|
|
|
### Contact |
|
For further information, please send an email to <[email protected]>. |
|
|
|
### Copyright |
|
Copyright(c) 2023 by Language Technologies Unit, Barcelona Supercomputing Center. |
|
|
|
### License |
|
[Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
### Funding |
|
This work has been promoted and financed by the Generalitat de Catalunya through the [Aina project](https://projecteaina.cat/). |
|
### Disclaimer |
|
|