AAOBA's picture
first commit
9bd9742
|
raw
history blame
1.82 kB

Umamusume DeBERTA-VITS2 TTS

πŸ‘Œ Currently, ONLY Japanese is supported. πŸ‘Œ

πŸ’ͺ Based on Bert-VITS2, this work tightly follows Akito/umamusume_bert_vits2, from which the Japanese text preprocessor is provided. ❀

βœ‹ Please do NOT enter a really LOOOONG sentence or sentences in a single row. Splitting your inputs into multiple rows makes each row to be inferenced separately. βœ‹

βœ‹ θ―·δΈθ¦εœ¨δΈ€θ‘Œε†…θΎ“ε…₯ι•Ώζ–‡ζœ¬οΌŒζ¨‘εž‹δΌšε°†ζ―θ‘Œηš„θΎ“ε…₯视为一ε₯θ―θΏ›θ‘ŒζŽ¨η†γ€‚θ―·ε°†ε€šε₯θ―εˆ†εˆ«ζ”Ύε…₯δΈεŒηš„θ‘ŒδΈ­ζ₯ε‡ε°‘ζŽ¨η†ζ—Άι—΄. βœ‹

Training Details - For those who may be interested

🎈 This work switches cl-tohoku/bert-base-japanese-v3 to ku-nlp/deberta-v2-base-japanese expecting potentially better performance, and, just for fun. πŸ₯°

❀ Thanks to SUSTech Center for Computational Science and Engineering. ❀ This model is trained on A100 (40GB) x 2 with batch size 32 in total.

πŸ’ͺ This model has been trained for 1 cycle, 90K steps (=60 epoch), currently. πŸ’ͺ

πŸ“• This work uses linear with warmup (7.5% of total steps) LR scheduler with max_lr=1e-4. πŸ“•

βœ‚ This work clips gradient value to 10 βœ‚.

⚠ Finetuning the model on single-speaker datasets separately will definitely reach better result than training on a huge dataset comprising of many speakers. Sharing a same model leads to unexpected mixing of the speaker's voice line. ⚠

TODO:

πŸ“… Train one more cycle using text preprocessor provided by AkitoP with better long tone processing capacity. πŸ“