Text-to-Speech
English

How can I get the timing of each word?

#141
by alissonlauffer - opened

I found this model, and it's nice what you've done with only 82M parameters.

I'd really like to get a report of the timing of each word so that I can create real-time subtitles of the text (in a word by word basis). I tried using split_pattern=" ", but the resulting audios were a bit off as the sentence context was lost.

Timestamps were merged in https://github.com/hexgrad/kokoro/pull/46 and if you access the tokens in the result, there should be a start_ts and end_ts per MToken: https://github.com/hexgrad/misaki/blob/d4289a30d992ce7ca9da93524ff6cefd3d62adb8/misaki/en.py#L27-L28

hexgrad changed discussion status to closed

I just tried it here, and it seems that tokens is just None for languages other than English :(

Sign up or log in to comment