AAOBA's picture
first commit
9bd9742
|
raw
history blame
1.82 kB
# Umamusume DeBERTA-VITS2 TTS
πŸ‘Œ **Currently, ONLY Japanese is supported.** πŸ‘Œ
πŸ’ͺ **Based on [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2), this work tightly follows [Akito/umamusume_bert_vits2](https://huggingface.co/spaces/AkitoP/umamusume_bert_vits2), from which the Japanese text preprocessor is provided.** ❀
βœ‹ **Please do NOT enter a really LOOOONG sentence or sentences in a single row. Splitting your inputs into multiple rows makes each row to be inferenced separately.** βœ‹
βœ‹ **θ―·δΈθ¦εœ¨δΈ€θ‘Œε†…θΎ“ε…₯ι•Ώζ–‡ζœ¬οΌŒζ¨‘εž‹δΌšε°†ζ―θ‘Œηš„θΎ“ε…₯视为一ε₯θ―θΏ›θ‘ŒζŽ¨η†γ€‚θ―·ε°†ε€šε₯θ―εˆ†εˆ«ζ”Ύε…₯δΈεŒηš„θ‘ŒδΈ­ζ₯ε‡ε°‘ζŽ¨η†ζ—Άι—΄.** βœ‹
## Training Details - For those who may be interested
🎈 **This work switches [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3) to [ku-nlp/deberta-v2-base-japanese](https://huggingface.co/ku-nlp/deberta-v2-base-japanese) expecting potentially better performance, and, just for fun.** πŸ₯°
❀ Thanks to **SUSTech Center for Computational Science and Engineering**. ❀ This model is trained on A100 (40GB) x 2 with **batch size 32** in total.
πŸ’ͺ This model has been trained for **1 cycle, 90K steps (=60 epoch),** currently. πŸ’ͺ
πŸ“• This work uses linear with warmup (7.5% of total steps) LR scheduler with ` max_lr=1e-4`. πŸ“•
βœ‚ This work clips gradient value to 10 βœ‚.
⚠ Finetuning the model on **single-speaker datasets separately** will definitely reach better result than training on a huge dataset comprising of many speakers. Sharing a same model leads to unexpected mixing of the speaker's voice line. ⚠
### TODO:
πŸ“… Train one more cycle using text preprocessor provided by [AkitoP](https://huggingface.co/AkitoP) with better long tone processing capacity. πŸ“