hexgrad commited on
Commit
e8263b3
·
verified ·
1 Parent(s): 536502c

Delete voices/README.md

Browse files
Files changed (1) hide show
  1. voices/README.md +0 -137
voices/README.md DELETED
@@ -1,137 +0,0 @@
1
- ---
2
- license: apache-2.0
3
- language:
4
- - en
5
- base_model:
6
- - yl4579/StyleTTS2-LJSpeech
7
- pipeline_tag: text-to-speech
8
- ---
9
- ❤️ Kokoro Discord Server: https://discord.gg/QuGxSWBfQy
10
-
11
- <audio controls><source src="https://huggingface.co/hexgrad/Kokoro-82M/resolve/main/demo/HEARME.wav" type="audio/wav"></audio>
12
-
13
- **Kokoro** is a frontier TTS model for its size of **82 million parameters** (text in/audio out).
14
-
15
- On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision along with 2 voicepacks (Bella and Sarah), all under an Apache 2.0 license.
16
-
17
- At the time of release, Kokoro v0.19 was the #1🥇 ranked model in [TTS Spaces Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena). With 82M params trained for <20 epochs on <100 total hours of audio, Kokoro achieved higher Elo in this single-voice Arena setting over models such as:
18
- - XTTS v2: 467M, CPML, >10k hours
19
- - Edge TTS: Microsoft, proprietary
20
- - MetaVoice: 1.2B, Apache, 100k hours
21
- - Parler Mini: 880M, Apache, 45k hours
22
- - Fish Speech: ~500M, CC-BY-NC-SA, 1M hours
23
-
24
- Kokoro's ability to top this Elo ladder using relatively low compute and data suggests that the scaling law for traditional TTS models might have a steeper slope than previously expected.
25
-
26
- You can find a hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
27
-
28
- ### Usage
29
-
30
- The following can be run in a single cell on [Google Colab](https://colab.research.google.com/).
31
- ```py
32
- # 1️⃣ Install dependencies silently
33
- !git clone https://huggingface.co/hexgrad/Kokoro-82M
34
- %cd Kokoro-82M
35
- !apt-get -qq -y install espeak-ng > /dev/null 2>&1
36
- !pip install -q phonemizer torch transformers scipy munch
37
-
38
- # 2️⃣ Build the model and load the default voicepack
39
- from models import build_model
40
- import torch
41
- device = 'cuda' if torch.cuda.is_available() else 'cpu'
42
- MODEL = build_model('kokoro-v0_19.pth', device)
43
- VOICEPACK = torch.load('voices/af.pt', weights_only=True).to(device)
44
-
45
- # 3️⃣ Call generate, which returns a 24khz audio waveform and a string of output phonemes
46
- from kokoro import generate
47
- text = "How could I know? It's an unanswerable question. Like asking an unborn child if they'll lead a good life. They haven't even been born."
48
- audio, out_ps = generate(MODEL, text, VOICEPACK)
49
-
50
- # 4️⃣ Display the 24khz audio and print the output phonemes
51
- from IPython.display import display, Audio
52
- display(Audio(data=audio, rate=24000, autoplay=True))
53
- print(out_ps)
54
- ```
55
- This inference code was quickly hacked together on Christmas Day. It is not clean code and leaves a lot of room for improvement. If you'd like to contribute, feel free to open a PR.
56
-
57
- ### Model Description
58
-
59
- No affiliation can be assumed between parties on different lines.
60
-
61
- **Architecture:**
62
- - StyleTTS 2: https://arxiv.org/abs/2306.07691
63
- - ISTFTNet: https://arxiv.org/abs/2203.02395
64
- - Decoder only: no diffusion, no encoder release
65
-
66
- **Architected by:** Li et al @ https://github.com/yl4579/StyleTTS2
67
-
68
- **Trained by**: `@rzvzn` on Discord
69
-
70
- **Supported Languages:** English
71
-
72
- **Model SHA256 Hash:** `3b0c392f87508da38fad3a2f9d94c359f1b657ebd2ef79f9d56d69503e470b0a`
73
-
74
- **Releases:**
75
- - 25 Dec 2024: Model v0.19, `af_bella`, `af_sarah`
76
- - 26 Dec 2024: `am_adam`, `am_michael`
77
-
78
- **Licenses:**
79
- - Apache 2.0 weights in this repository
80
- - MIT inference code in [spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS) adapted from [yl4579/StyleTTS2](https://github.com/yl4579/StyleTTS2)
81
- - GPLv3 dependency in [espeak-ng](https://github.com/espeak-ng/espeak-ng)
82
-
83
- The inference code was originally MIT licensed by the paper author. Note that this card applies only to this model, Kokoro. Original models published by the paper author can be found at [hf.co/yl4579](https://huggingface.co/yl4579).
84
-
85
- ### Evaluation
86
-
87
- **Metric:** Elo rating
88
-
89
- **Leaderboard:** [hf.co/spaces/Pendrokar/TTS-Spaces-Arena](https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena)
90
-
91
- ![TTS-Spaces-Arena-25-Dec-2024](demo/TTS-Spaces-Arena-25-Dec-2024.png)
92
-
93
- The voice ranked in the Arena is a 50-50 mix of Bella and Sarah. For your convenience, this mix is included in this repository as `af.pt`, but you can trivially reproduce it like this:
94
-
95
- ```py
96
- import torch
97
- bella = torch.load('voices/af_bella.pt', weights_only=True)
98
- sarah = torch.load('voices/af_sarah.pt', weights_only=True)
99
- af = torch.mean(torch.stack([bella, sarah]), dim=0)
100
- assert torch.equal(af, torch.load('voices/af.pt', weights_only=True))
101
- ```
102
-
103
- ### Training Details
104
-
105
- **Compute:** Kokoro was trained on A100 80GB vRAM instances rented from [Vast.ai](https://cloud.vast.ai/?ref_id=79907) (referral link). Vast was chosen over other compute providers due to its competitive on-demand hourly rates. The average hourly cost for the A100 80GB vRAM instances used for training was below $1/hr per GPU, which was around half the quoted rates from other providers at the time.
106
-
107
- **Data:** Kokoro was trained exclusively on **permissive/non-copyrighted audio data** and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:
108
- - Public domain audio
109
- - Audio licensed under Apache, MIT, etc
110
- - Synthetic audio<sup>[1]</sup> generated by closed<sup>[2]</sup> TTS models from large providers<br/>
111
- [1] https://copyright.gov/ai/ai_policy_guidance.pdf<br/>
112
- [2] No synthetic audio from open TTS models or "custom voice clones"
113
-
114
- **Epochs:** Less than **20 epochs**
115
-
116
- **Total Dataset Size:** Less than **100 hours** of audio
117
-
118
- ### Limitations
119
-
120
- Kokoro v0.19 is limited in some ways, in its training set and architecture:
121
- - [Data] Lacks voice cloning capability, likely due to small <100h training set
122
- - [Arch] Relies on external g2p (espeak-ng), which introduces a class of g2p failure modes
123
- - [Data] Training dataset is mostly long-form reading and narration, not conversation
124
- - [Arch] At 82M params, Kokoro almost certainly falls to a well-trained 1B+ param diffusion transformer, or a many-billion-param MLLM like GPT-4o / Gemini 2.0 Flash
125
- - [Data] Multilingual capability is architecturally feasible, but training data is almost entirely English
126
-
127
- **Will the other voicepacks be released?** There is currently no release date scheduled for the other voicepacks, but in the meantime you can try them in the hosted demo at [hf.co/spaces/hexgrad/Kokoro-TTS](https://huggingface.co/spaces/hexgrad/Kokoro-TTS).
128
-
129
- ### Acknowledgements
130
- - [@yl4579](https://huggingface.co/yl4579) for architecting StyleTTS 2
131
- - [@Pendrokar](https://huggingface.co/Pendrokar) for adding Kokoro as a contender in the TTS Spaces Arena
132
-
133
- ### Model Card Contact
134
-
135
- `@rzvzn` on Discord. Server invite: https://discord.gg/QuGxSWBfQy
136
-
137
- <img src="https://static0.gamerantimages.com/wordpress/wp-content/uploads/2024/08/terminator-zero-41-1.jpg" width="400" alt="kokoro" />