geneing
/

Kokoro

Text-to-Speech

English

Model card Files Files and versions Community

geneing commited on 9 days ago

Commit

acd8dd4

1 Parent(s): b8db573

Merged from upstream.

Browse files

Files changed (4) hide show

README.md +11 -3
kokoro.py +2 -2
models.py +2 -220
restoring-sky.md +0 -44

README.md CHANGED Viewed

@@ -8,11 +8,13 @@ pipeline_tag: text-to-speech
 ---
 ❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy
 <audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/demo/HEARME.wav" type="audio/wav"></audio>
 **Kokoro** is a frontier TTS model for its size of **82 million parameters** (text in/audio out).
-On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. As of 31 Dec 2024, 10 unique Voicepacks have been released.
 In the weeks leading up to its release, Kokoro v0.19 was the #1🥇 ranked model in [TTS Spaces Arena](https://huggingface.co/hexgrad/Kokoro-82M#evaluation). Kokoro had achieved higher Elo in this single-voice Arena setting over other models, using fewer parameters and less data:
 1. **Kokoro v0.19: 82M params, Apache, trained on <100 hours of audio**
@@ -31,6 +33,7 @@ You can find a hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingf
 The following can be run in a single cell on [Google Colab](https://colab.research.google.com/).
 ```py
 # 1️⃣ Install dependencies silently
 !git clone https://huggingface.co/hexgrad/Kokoro-82M
 %cd Kokoro-82M
 !apt-get -qq -y install espeak-ng > /dev/null 2>&1
@@ -63,7 +66,9 @@ from IPython.display import display, Audio
 display(Audio(data=audio, rate=24000, autoplay=True))
 print(out_ps)
 ```
-The inference code was quickly hacked together on Christmas Day. It is not clean code and leaves a lot of room for improvement. If you'd like to contribute, feel free to open a PR.
 ### Model Facts
@@ -88,6 +93,7 @@ No affiliation can be assumed between parties on different lines.
 - 28 Dec 2024: `bf_emma`, `bf_isabella`, `bm_george`, `bm_lewis`
 - 30 Dec 2024: `af_nicole`
 - 31 Dec 2024: `af_sky`
 ### Licenses
 - Apache 2.0 weights in this repository
@@ -150,4 +156,6 @@ Refer to the [Philosophy discussion](https://huggingface.co/hexgrad/Kokoro-82M/d
 `@rzvzn` on Discord. Server invite: https://discord.gg/QuGxSWBfQy
-<img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />

 ---
 ❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy
+📣 Got Synthetic Data? Want Trained Voicepacks? See https://hf.co/posts/hexgrad/418806998707773
 <audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/demo/HEARME.wav" type="audio/wav"></audio>
 **Kokoro** is a frontier TTS model for its size of **82 million parameters** (text in/audio out).
+On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. As of 2 Jan 2025, 10 unique Voicepacks have been released, and a `.onnx` version of v0.19 is available.
 In the weeks leading up to its release, Kokoro v0.19 was the #1🥇 ranked model in [TTS Spaces Arena](https://huggingface.co/hexgrad/Kokoro-82M#evaluation). Kokoro had achieved higher Elo in this single-voice Arena setting over other models, using fewer parameters and less data:
 1. **Kokoro v0.19: 82M params, Apache, trained on <100 hours of audio**
 The following can be run in a single cell on [Google Colab](https://colab.research.google.com/).
 ```py
 # 1️⃣ Install dependencies silently
+!git lfs install
 !git clone https://huggingface.co/hexgrad/Kokoro-82M
 %cd Kokoro-82M
 !apt-get -qq -y install espeak-ng > /dev/null 2>&1
 display(Audio(data=audio, rate=24000, autoplay=True))
 print(out_ps)
 ```
+If you have trouble with `espeak-ng`, see this [github issue](https://github.com/bootphon/phonemizer/issues/44#issuecomment-1540885186). [Mac users also see this](https://huggingface.co/hexgrad/Kokoro-82M/discussions/12#677435d3d8ace1de46071489), and [Windows users see this](https://huggingface.co/hexgrad/Kokoro-82M/discussions/12#67742594fdeebf74f001ecfc).
+For ONNX usage, see [#14](https://huggingface.co/hexgrad/Kokoro-82M/discussions/14).
 ### Model Facts
 - 28 Dec 2024: `bf_emma`, `bf_isabella`, `bm_george`, `bm_lewis`
 - 30 Dec 2024: `af_nicole`
 - 31 Dec 2024: `af_sky`
+- 2 Jan 2025: ONNX v0.19 `ebef4245`
 ### Licenses
 - Apache 2.0 weights in this repository
 `@rzvzn` on Discord. Server invite: https://discord.gg/QuGxSWBfQy
+<img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />
+https://terminator.fandom.com/wiki/Kokoro

kokoro.py CHANGED Viewed

@@ -135,8 +135,8 @@ def forward(model, tokens, ref_s, speed):
     asr = t_en @ pred_aln_trg.unsqueeze(0).to(device)
     return model.decoder(asr, F0_pred, N_pred, ref_s[:, :128]).squeeze().cpu().numpy()
-def generate(model, text, voicepack, lang='a', speed=1):
-    ps = phonemize(text, lang)
     tokens = tokenize(ps)
     if not tokens:
         return None

     asr = t_en @ pred_aln_trg.unsqueeze(0).to(device)
     return model.decoder(asr, F0_pred, N_pred, ref_s[:, :128]).squeeze().cpu().numpy()
+def generate(model, text, voicepack, lang='a', speed=1, ps=None):
+    ps = ps or phonemize(text, lang)
     tokens = tokenize(ps)
     if not tokens:
         return None

models.py CHANGED Viewed

@@ -1,6 +1,5 @@
 # https://github.com/yl4579/StyleTTS2/blob/main/models.py
-from ast import Tuple
-from istftnet import Decoder
 from munch import Munch
 from pathlib import Path
 from plbert import load_plbert
@@ -13,118 +12,6 @@ import torch
 import torch.nn as nn
 import torch.nn.functional as F
-class LearnedDownSample(nn.Module):
-    def __init__(self, layer_type, dim_in):
-        super().__init__()
-        self.layer_type = layer_type
-        if self.layer_type == 'none':
-            self.conv = nn.Identity()
-        elif self.layer_type == 'timepreserve':
-            self.conv = spectral_norm(nn.Conv2d(dim_in, dim_in, kernel_size=(3, 1), stride=(2, 1), groups=dim_in, padding=(1, 0)))
-        elif self.layer_type == 'half':
-            self.conv = spectral_norm(nn.Conv2d(dim_in, dim_in, kernel_size=(3, 3), stride=(2, 2), groups=dim_in, padding=1))
-        else:
-            raise RuntimeError('Got unexpected donwsampletype %s, expected is [none, timepreserve, half]' % self.layer_type)
-    def forward(self, x):
-        return self.conv(x)
-class LearnedUpSample(nn.Module):
-    def __init__(self, layer_type, dim_in):
-        super().__init__()
-        self.layer_type = layer_type
-        if self.layer_type == 'none':
-            self.conv = nn.Identity()
-        elif self.layer_type == 'timepreserve':
-            self.conv = nn.ConvTranspose2d(dim_in, dim_in, kernel_size=(3, 1), stride=(2, 1), groups=dim_in, output_padding=(1, 0), padding=(1, 0))
-        elif self.layer_type == 'half':
-            self.conv = nn.ConvTranspose2d(dim_in, dim_in, kernel_size=(3, 3), stride=(2, 2), groups=dim_in, output_padding=1, padding=1)
-        else:
-            raise RuntimeError('Got unexpected upsampletype %s, expected is [none, timepreserve, half]' % self.layer_type)
-    def forward(self, x):
-        return self.conv(x)
-class DownSample(nn.Module):
-    def __init__(self, layer_type):
-        super().__init__()
-        self.layer_type = layer_type
-    def forward(self, x):
-        if self.layer_type == 'none':
-            return x
-        elif self.layer_type == 'timepreserve':
-            return F.avg_pool2d(x, (2, 1))
-        elif self.layer_type == 'half':
-            if x.shape[-1] % 2 != 0:
-                x = torch.cat([x, x[..., -1].unsqueeze(-1)], dim=-1)
-            return F.avg_pool2d(x, 2)
-        else:
-            raise RuntimeError('Got unexpected donwsampletype %s, expected is [none, timepreserve, half]' % self.layer_type)
-class UpSample(nn.Module):
-    def __init__(self, layer_type):
-        super().__init__()
-        self.layer_type = layer_type
-    def forward(self, x):
-        if self.layer_type == 'none':
-            return x
-        elif self.layer_type == 'timepreserve':
-            return F.interpolate(x, scale_factor=(2, 1), mode='nearest')
-        elif self.layer_type == 'half':
-            return F.interpolate(x, scale_factor=2, mode='nearest')
-        else:
-            raise RuntimeError('Got unexpected upsampletype %s, expected is [none, timepreserve, half]' % self.layer_type)
-class ResBlk(nn.Module):
-    def __init__(self, dim_in, dim_out, actv=nn.LeakyReLU(0.2),
-                 normalize=False, downsample='none'):
-        super().__init__()
-        self.actv = actv
-        self.normalize = normalize
-        self.downsample = DownSample(downsample)
-        self.downsample_res = LearnedDownSample(downsample, dim_in)
-        self.learned_sc = dim_in != dim_out
-        self._build_weights(dim_in, dim_out)
-    def _build_weights(self, dim_in, dim_out):
-        self.conv1 = spectral_norm(nn.Conv2d(dim_in, dim_in, 3, 1, 1))
-        self.conv2 = spectral_norm(nn.Conv2d(dim_in, dim_out, 3, 1, 1))
-        if self.normalize:
-            self.norm1 = nn.InstanceNorm2d(dim_in, affine=True)
-            self.norm2 = nn.InstanceNorm2d(dim_in, affine=True)
-        if self.learned_sc:
-            self.conv1x1 = spectral_norm(nn.Conv2d(dim_in, dim_out, 1, 1, 0, bias=False))
-    def _shortcut(self, x):
-        if self.learned_sc:
-            x = self.conv1x1(x)
-        if self.downsample:
-            x = self.downsample(x)
-        return x
-    def _residual(self, x):
-        if self.normalize:
-            x = self.norm1(x)
-        x = self.actv(x)
-        x = self.conv1(x)
-        x = self.downsample_res(x)
-        if self.normalize:
-            x = self.norm2(x)
-        x = self.actv(x)
-        x = self.conv2(x)
-        return x
-    def forward(self, x):
-        x = self._shortcut(x) + self._residual(x)
-        return x / np.sqrt(2)  # unit variance
 class LinearNorm(torch.nn.Module):
     def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'):
         super(LinearNorm, self).__init__()
@@ -137,98 +24,6 @@ class LinearNorm(torch.nn.Module):
     def forward(self, x):
         return self.linear_layer(x)
-class Discriminator2d(nn.Module):
-    def __init__(self, dim_in=48, num_domains=1, max_conv_dim=384, repeat_num=4):
-        super().__init__()
-        blocks = []
-        blocks += [spectral_norm(nn.Conv2d(1, dim_in, 3, 1, 1))]
-        for lid in range(repeat_num):
-            dim_out = min(dim_in*2, max_conv_dim)
-            blocks += [ResBlk(dim_in, dim_out, downsample='half')]
-            dim_in = dim_out
-        blocks += [nn.LeakyReLU(0.2)]
-        blocks += [spectral_norm(nn.Conv2d(dim_out, dim_out, 5, 1, 0))]
-        blocks += [nn.LeakyReLU(0.2)]
-        blocks += [nn.AdaptiveAvgPool2d(1)]
-        blocks += [spectral_norm(nn.Conv2d(dim_out, num_domains, 1, 1, 0))]
-        self.main = nn.Sequential(*blocks)
-    def get_feature(self, x):
-        features = []
-        for l in self.main:
-            x = l(x)
-            features.append(x)
-        out = features[-1]
-        out = out.view(out.size(0), -1)  # (batch, num_domains)
-        return out, features
-    def forward(self, x):
-        out, features = self.get_feature(x)
-        out = out.squeeze()  # (batch)
-        return out, features
-class ResBlk1d(nn.Module):
-    def __init__(self, dim_in, dim_out, actv=nn.LeakyReLU(0.2),
-                 normalize=False, downsample='none', dropout_p=0.2):
-        super().__init__()
-        self.actv = actv
-        self.normalize = normalize
-        self.downsample_type = downsample
-        self.learned_sc = dim_in != dim_out
-        self._build_weights(dim_in, dim_out)
-        self.dropout_p = dropout_p
-        if self.downsample_type == 'none':
-            self.pool = nn.Identity()
-        else:
-            self.pool = weight_norm(nn.Conv1d(dim_in, dim_in, kernel_size=3, stride=2, groups=dim_in, padding=1))
-    def _build_weights(self, dim_in, dim_out):
-        self.conv1 = weight_norm(nn.Conv1d(dim_in, dim_in, 3, 1, 1))
-        self.conv2 = weight_norm(nn.Conv1d(dim_in, dim_out, 3, 1, 1))
-        if self.normalize:
-            self.norm1 = nn.InstanceNorm1d(dim_in, affine=True)
-            self.norm2 = nn.InstanceNorm1d(dim_in, affine=True)
-        if self.learned_sc:
-            self.conv1x1 = weight_norm(nn.Conv1d(dim_in, dim_out, 1, 1, 0, bias=False))
-    def downsample(self, x):
-        if self.downsample_type == 'none':
-            return x
-        else:
-            if x.shape[-1] % 2 != 0:
-                x = torch.cat([x, x[..., -1].unsqueeze(-1)], dim=-1)
-            return F.avg_pool1d(x, 2)
-    def _shortcut(self, x):
-        if self.learned_sc:
-            x = self.conv1x1(x)
-        x = self.downsample(x)
-        return x
-    def _residual(self, x):
-        if self.normalize:
-            x = self.norm1(x)
-        x = self.actv(x)
-        x = F.dropout(x, p=self.dropout_p, training=self.training)
-        x = self.conv1(x)
-        x = self.pool(x)
-        if self.normalize:
-            x = self.norm2(x)
-        x = self.actv(x)
-        x = F.dropout(x, p=self.dropout_p, training=self.training)
-        x = self.conv2(x)
-        return x
-    def forward(self, x):
-        x = self._shortcut(x) + self._residual(x)
-        return x / np.sqrt(2)  # unit variance
 class LayerNorm(nn.Module):
     def __init__(self, channels, eps=1e-5):
         super().__init__()
@@ -313,19 +108,6 @@ class TextEncoder(nn.Module):
         return mask
-class AdaIN1d(nn.Module):
-    def __init__(self, style_dim, num_features):
-        super().__init__()
-        self.norm = nn.InstanceNorm1d(num_features, affine=False)
-        self.fc = nn.Linear(style_dim, num_features*2)
-    def forward(self, x, s):
-        h = self.fc(s)
-        h = h.view(h.size(0), h.size(1), 1)
-        gamma, beta = torch.chunk(h, chunks=2, dim=1)
-        return (1 + gamma) * self.norm(x) + beta
 class UpSample1d(nn.Module):
     def __init__(self, layer_type):
         super().__init__()
@@ -484,7 +266,7 @@ class ProsodyPredictor(nn.Module):
         mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
         mask = torch.gt(mask+1, lengths.unsqueeze(1))
         return mask
 class DurationEncoder(nn.Module):
     def __init__(self, sty_dim, d_model, nlayers, dropout=0.1):

 # https://github.com/yl4579/StyleTTS2/blob/main/models.py
+from istftnet import AdaIN1d, Decoder
 from munch import Munch
 from pathlib import Path
 from plbert import load_plbert
 import torch.nn as nn
 import torch.nn.functional as F
 class LinearNorm(torch.nn.Module):
     def __init__(self, in_dim, out_dim, bias=True, w_init_gain='linear'):
         super(LinearNorm, self).__init__()
     def forward(self, x):
         return self.linear_layer(x)
 class LayerNorm(nn.Module):
     def __init__(self, channels, eps=1e-5):
         super().__init__()
         return mask
 class UpSample1d(nn.Module):
     def __init__(self, layer_type):
         super().__init__()
         mask = torch.arange(lengths.max()).unsqueeze(0).expand(lengths.shape[0], -1).type_as(lengths)
         mask = torch.gt(mask+1, lengths.unsqueeze(1))
         return mask
 class DurationEncoder(nn.Module):
     def __init__(self, sty_dim, d_model, nlayers, dropout=0.1):

restoring-sky.md DELETED Viewed

@@ -1,44 +0,0 @@
-# Restoring Sky & reflecting on Kokoro
-<img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />
-For those who don't know, [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) is an Apache TTS model that uses a skinny version of the open [StyleTTS 2](https://github.com/yl4579/StyleTTS2/tree/main) architecture.
-Based on leaderboard [Elo rating](https://huggingface.co/hexgrad/Kokoro-82M#evaluation) (prior to getting [review bombed](https://huggingface.co/datasets/Pendrokar/TTS_Arena/discussions/2)), Kokoro appears to do more with less, a theme that is surely [top-of-mind](https://huggingface.co/deepseek-ai/DeepSeek-V3) for many. It's peak performance on specific voices is comparable or better than much larger models, but it has not yet been trained on enough data to effectively zero-shot out of distribution (aka voice cloning).
-Tonight on NYE, `af_sky` joins Kokoro's roster of downloadable voices. This follows last night's quiet release of `af_nicole`, and an additional 8 voices are currently available: 2F 2M voices each for American & British English.
-Nicole in particular was trained on ~10 hours of synthetic data, and demonstrates that you _can_ include unique speaking styles in a general-purpose TTS model without affecting the stock voices (even in a low data small model): a good sign for scalability.
-Sky is interesting because it is the voice that ScarJo [got OpenAI to take down](https://x.com/OpenAI/status/1792443575839678909), so new training data cannot be generated. However, OpenAI did not remove 2023 samples of Sky from their [blog post](https://openai.com/index/chatgpt-can-now-see-hear-and-speak/), and along with a few seconds lying around various other parts of the internet, we can cobble together about 3 minutes of 2023 Sky.
-```sh
-wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/story-sky.mp3
-wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/recipe-sky.mp3
-wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/speech-sky.mp3
-wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/poem-sky.mp3
-wget https://cdn.openai.com/new-voice-and-image-capabilities-in-chatgpt/hd/info-sky.mp3
-```
-To be clear, this is not the first attempt to reconstruct Sky. On X, Benjamin De Kraker posted:
-> Here's the official statement released by Scarlett Johansson, detailing OpenAI's alleged illegal usage of her voice...
->
-> ...read by the Sky AI voice, because irony.
->
-> https://x.com/BenjaminDEKR/status/1792693868497871086
-and in the replies, he [stated](https://x.com/BenjaminDEKR/status/1792714347275501595):
-> It's an ElevenLabs clone I made based on Sky audio before they removed it. Not perfect.
-Here is `Kokoro/af_sky`'s rendition of the same:
-<audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/demo/af_sky.wav" type="audio/wav"></audio>
-A crude reconstruction, but the model that produced that voice is Apache FOSS that can be downloaded from HF and run locally. You can reproduce the above by dragging the [text script](https://huggingface.co/hexgrad/Kokoro-82M/blob/main/demo/af_sky.txt) (note a handful of modified chars for better delivery) into the "Long Form" tab of this [hosted demo](https://huggingface.co/spaces/hexgrad/Kokoro-TTS), or you can download the [model weights](https://huggingface.co/hexgrad/Kokoro-82M), install dependencies and DIY.
-Sky shows that it is possible to reconstruct a voice—maybe a shadow of its former self, but a reconstruction nonetheless—from fairly little training data.
-### What's next
-Kokoro is a good start, but I can think of some tricks that might make it better, beginning with better data. More on this in another article.
-Feel free to check out [Kokoro's weights](https://huggingface.co/hexgrad/Kokoro-82M), try out a no-install [hosted demo](https://huggingface.co/spaces/hexgrad/Kokoro-TTS), and/or [join the Discord](https://discord.gg/QuGxSWBfQy).