File size: 5,331 Bytes
02b2725
 
 
 
 
 
 
 
 
 
 
 
5af228d
 
 
02b2725
 
 
 
 
 
45acf22
02b2725
 
 
5af228d
02b2725
 
 
5af228d
 
 
 
 
 
bc8f40a
02b2725
 
 
5af228d
 
 
 
 
02b2725
 
 
bc8f40a
5af228d
02b2725
 
 
45acf22
93b2b84
 
02b2725
 
 
 
 
45acf22
02b2725
 
 
45acf22
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
02b2725
 
 
 
93b2b84
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
language:
- da
license: mit
base_model: microsoft/speecht5_tts
tags:
- generated_from_trainer
datasets:
- alexandrainst/nst-da
model-index:
- name: speecht5_tts-finetuned-nst-da
  results: []
metrics:
- mse
pipeline_tag: text-to-speech
---

# speecht5_tts-finetuned-nst-da

This model is a fine-tuned version of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts) on the NST Danish ASR Database dataset.
It achieves the following results on the evaluation set:
- Loss: 0.3692

## Model description

Given that danish is a low-resource language, not many open-source implementations of a danish text-to-speech synthesizer are available online. As of writing, the only other existing implementations available on 🤗 are [facebook/seamless-streaming](https://huggingface.co/facebook/seamless-streaming) and [audo/seamless-m4t-v2-large](https://huggingface.co/audo/seamless-m4t-v2-large). This model has been developed to provide a simpler alternative that still performs reasonable well, both in terms of output quality and inference time. Additionally, contrary to the aforementioned models, this model also has an associated Space on 🤗 at [JackismyShephard/danish-speech-synthesis](https://huggingface.co/spaces/JackismyShephard/danish-speech-synthesis) which provides an easy interface for danish text-to-speech synthesis, as well as optional speech enhancement.

## Intended uses & limitations

The model is intended for danish text-to-speech synthesis. 

The model does not recognize special symbols such as "æ", "ø" and "å", as it uses the default tokenizer of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts). The model performs best for short to medium length input text and expects input text to contain no more than 600 vocabulary tokens. Additionally, for best performance the model should be given a danish speaker embedding, ideally generated from an audio clip from the training split of [alexandrainst/nst-da](https://huggingface.co/datasets/alexandrainst/nst-da) using [speechbrain/spkrec-xvect-voxceleb](https://huggingface.co/speechbrain/spkrec-xvect-voxceleb).

The output of the model is a log-mel spectogram, which should be converted to a waveform using [microsoft/speecht5_hifigan](https://huggingface.co/microsoft/speecht5_hifigan). For better quality output the resulting waveform can be enhanced using [ResembleAI/resemble-enhance](https://huggingface.co/ResembleAI/resemble-enhance).

An example script showing how to use the model for inference can be found [here](https://github.com/JackismyShephard/hugging-face-audio-course/blob/main/notebooks/inference/finetuned-nst-da-inference.ipynb).

## Training and evaluation data

The model was trained and evaluated on [alexandrainst/nst-da](https://huggingface.co/datasets/alexandrainst/nst-da) using MSE as both loss and metric. The dataset was pre-processed as follows:
* special characters, such as "æ", "ø" and "å" were translated to their latin equivalents and examples with text containing digits were removed, as neiher are in the vocabulary of the tokenizer of [microsoft/speecht5_tts](https://huggingface.co/microsoft/speecht5_tts).
* training split balancing was done by excluding speakers with less than 280 examples or more than 327 examples.
* audio was enhanced using [speechbrain/metricgan-plus-voicebank](https://huggingface.co/speechbrain/metricgan-plus-voicebank) in an attempt to remove unwanted noise.


## Training procedure

The script used for training the model (and pre-processing its data) can be found [here](https://github.com/JackismyShephard/hugging-face-audio-course/blob/main/notebooks/training/finetuned-nst-da-training.ipynb).

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.1
- num_epochs: 20
- mixed_precision_training: Native AMP

### Training results

| Training Loss | Epoch | Step   | Validation Loss |
|:-------------:|:-----:|:------:|:---------------:|
| 0.4445        | 1.0   | 9429   | 0.4100          |
| 0.4169        | 2.0   | 18858  | 0.3955          |
| 0.412         | 3.0   | 28287  | 0.3882          |
| 0.3982        | 4.0   | 37716  | 0.3826          |
| 0.4032        | 5.0   | 47145  | 0.3817          |
| 0.3951        | 6.0   | 56574  | 0.3782          |
| 0.3971        | 7.0   | 66003  | 0.3782          |
| 0.395         | 8.0   | 75432  | 0.3757          |
| 0.3952        | 9.0   | 84861  | 0.3749          |
| 0.3835        | 10.0  | 94290  | 0.3740          |
| 0.3863        | 11.0  | 103719 | 0.3754          |
| 0.3845        | 12.0  | 113148 | 0.3732          |
| 0.3788        | 13.0  | 122577 | 0.3715          |
| 0.3834        | 14.0  | 132006 | 0.3717          |
| 0.3894        | 15.0  | 141435 | 0.3718          |
| 0.3845        | 16.0  | 150864 | 0.3714          |
| 0.3823        | 17.0  | 160293 | 0.3692          |
| 0.3858        | 18.0  | 169722 | 0.3703          |
| 0.3919        | 19.0  | 179151 | 0.3716          |
| 0.3906        | 20.0  | 188580 | 0.3709          |


### Framework versions

- Transformers 4.37.2
- Pytorch 2.1.1+cu121
- Datasets 2.17.0
- Tokenizers 0.15.2