Umamusume DeBERTA-VITS2 TTS

📅 2023.10.19 📅

Updated current Generator to 120K steps' checkpoint

👌 Currently, ONLY Japanese is supported. 👌

💪 Based on Bert-VITS2, this work tightly follows Akito/umamusume_bert_vits2, from which the Japanese text preprocessor is provided. ❤

✋ Please do NOT enter a really LOOOONG sentence or sentences in a single row. Splitting your inputs into multiple rows makes each row to be inferenced separately. Please avoid completely empty rows, which will lead to weird sounds in the corresponding spaces in the generated audio. ✋

✋ 请不要在一行内输入超长文本，模型会将每行的输入视为一句话进行推理。在不影响语意连贯的情况下，请将多句话分别放入不同的行中来减少推理时间。请删除输入中的空白行，这会导致在生成的语音的对应位置中产生奇怪的声音。 ✋

👏 When encountering situations where an error occurs, please check if there's rare and difficult CHINISE CHARACTERS in your inputs, and replace them with Hiragana or Katakana. 👏

👏 如果生成出现了错误，请首先检查输入中是否存在非常少见的生僻汉字，如果有，请将其替换为平假名或者片假名。 👏

Training Details - For those who may be interested

🎈 This work switches cl-tohoku/bert-base-japanese-v3 to ku-nlp/deberta-v2-base-japanese expecting potentially better performance, and, just for fun. 🥰

❤ Thanks to SUSTech Center for Computational Science and Engineering. ❤ This model is trained on A100 (40GB) x 2 with batch size 32 in total.

💪 This model has been trained for 1 cycle, 90K steps (=60 epoch), currently. 💪

📕 This work uses linear with warmup (7.5% of total steps) LR scheduler with max_lr=1e-4. 📕

✂ This work clips gradient value to 10 ✂.

⚠ Finetuning the model on single-speaker datasets separately will definitely reach better result than training on a huge dataset comprising of many speakers. Sharing a same model leads to unexpected mixing of the speaker's voice line. ⚠

TODO:

(Started) 📅 Train one more cycle using text preprocessor provided by AkitoP with better long tone processing capacity. 📅