Spaces:
Running
on
Zero
Running
on
Zero
File size: 8,262 Bytes
68436b6 bdba9cc 4700bf2 68436b6 bdba9cc 571ce43 bdba9cc 4700bf2 b524023 4700bf2 bdba9cc 4700bf2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 |
import os
import subprocess
subprocess.run(['apt-get', 'update'])
subprocess.run(['apt-get', 'install', '-y', 'build-essential', 'gawk', 'libasound2-dev', 'libpulse-dev', 'autoconf', 'automake', 'libtool'])
subprocess.run(['wget', 'https://github.com/espeak-ng/espeak-ng/archive/refs/tags/1.52.0.tar.gz'])
subprocess.run(['tar', 'xf', '1.52.0.tar.gz'])
cwd = 'espeak-ng-1.52.0'
subprocess.run(['./autogen.sh'], cwd=cwd)
subprocess.run(['./configure'], cwd=cwd)
subprocess.run(['make'], cwd=cwd)
subprocess.run(['make', 'install'], cwd=cwd)
del cwd
env = os.environ.copy()
env['LD_PRELOAD'] = '/usr/local/lib/libespeak-ng.so.1'
subprocess.run(['espeak-ng', '--version'], env=env)
from phonemizer.backend.espeak.wrapper import EspeakWrapper
EspeakWrapper.set_library('/usr/local/lib/libespeak-ng.so.1')
import spaces
@spaces.GPU
def greet(n):
return f"Hello {zero + n} Tensor"
from misaki import en, espeak
import gradio as gr
import pprint
import time
fbs = [espeak.EspeakFallback(british=british) for british in (False, True)]
g2p = [[en.G2P(trf=trf, british=british, fallback=fbs[british]) for british in (False, True)] for trf in (False, True)]
def predict(text, use_spacy_transformer, british):
start = time.time()
ps, tokens = g2p[use_spacy_transformer][british](text)
debug = []
for word in tokens:
if isinstance(word, list):
debug.append([])
for t in word:
debug[-1].append(t.debug_all())
else:
debug.append(word.debug_all())
trace = pprint.pformat(debug)
elapsed_cpu_time = time.time() - start
return ps, len(ps), trace, elapsed_cpu_time
with gr.Blocks() as app:
gr.Markdown('''
Misaki is an experimental G2P engine designed to power future versions of Kokoro models.
This English-only preview is primarily intended for researchers and linguists. It may be deeply uninteresting to most people.
''', container=True)
gr.Interface(fn=predict, inputs=[gr.Text(), gr.Checkbox(), gr.Checkbox()],
outputs=[gr.Text(label='phonemes'), gr.Number(label='token_count <= 510 fits in Kokoro context length'), gr.Text(label='trace'), gr.Number(label='elapsed_cpu_time')])
gr.Markdown('''
### Examples
```md
American: [Misaki](/misˈɑki/) is an experimental G2P engine designed to power future versions of [Kokoro](/kˈOkəɹO/) models.
British: [Misaki](/misˈɑːki/) is an experimental G2P engine designed to power future versions of [Kokoro](/kˈQkəɹQ/) models.
But I am the Chosen One. But I [am](+1) the Chosen [One](-1). But I [am](+2) the Chosen [One](-2).
1002. [1002](#a#). [1002](#an#). [1002](#a&#). 2025. 2,025. $45.67 billion trillion.
```
''', container=True)
gr.Markdown('''
### Token-Level Trace
```py
# 1. Text. Can be useful for aligning text to phonemes, e.g. highlighting text during audio playback.
# 2. Tag. See a full list of tags from spaCy:
# https://github.com/explosion/spaCy/blob/master/spacy/glossary.py
# 3. Whitespace. Whether or not a token has trailing whitespace (string => bool for this demo).
whitespace = True if whitespace else False
# 4. Phonemes. For this demo, the question mark means UNK, the ninja emoji means empty string.
phonemes = '❓' if phonemes is None else ('🥷' if phonemes == '' else phonemes)
# 5. Rating. Star rating for the estimated quality of this token's phonemes.
ratings = dict(
user_override = '💎(5/5)',
gold = '🏆(4/5)',
silver = '🥈(3/5)',
bronze = '🥉(2/5)',
unk = '❓(UNK)',
)
```
''', container=True)
gr.Markdown('''
### Notes
- For English, Misaki uses a gold dictionary with 80k words and a similarly sized silver dictionary.
- There are separate dictionaries for American & British English.
- Users can override the dictionary and/or individual tokens with custom pronunciations.
- `espeak-ng` is used as the fallback for OOD words, and the token is rated "bronze" in this case.
- Raw token objects are returned, with phonemes aligned at the per-token level.
- UNKs are easy to detect when `token.phonemes is None`.
- The entire implementation of Misaki (English) is <1000 lines of Python, excluding dictionary files.
- POS disambiguation should be live, e.g. to wound someone vs wound up.
- use_spacy_transformer should deliver more reliable POS tags.
- Non-POS-based disambiguation, like graph axes vs throwing axes, is still a TODO.
''', container=True)
with gr.Blocks() as info:
gr.Markdown('''
# Misaki English Phonemes
For English, Misaki currently uses 49 total phonemes. Of these, 41 are shared by both Americans and Brits, 4 are American-only, and 4 are British-only.
Disclaimer: Author is an ML researcher, not a linguist, and may have butchered or reappropriated the traditional meaning of some symbols. These symbols are intended as input tokens for neural networks to yield optimal performance.
### 🤝 Shared (41)
**Stress Marks (2)**
- `ˈ`: Primary stress, visually looks similar to an apostrophe.
- `ˌ`: Secondary stress.
**IPA Consonants (22)**
- `bdfhjklmnpstvwz`: 15 alpha consonants taken from IPA. They mostly sound as you'd expect, but `j` actually represents the "y" sound, like `yes => jˈɛs`.
- `ɡ`: Hard "g" sound, like `get => ɡɛt`. Visually looks like the lowercase letter g, but its actually `U+0261`.
- `ŋ`: The "ng" sound, like `sung => sˈʌŋ`.
- `ɹ`: Upside-down r is just an "r" sound, like `red => ɹˈɛd`.
- `ʃ`: The "sh" sound, like `shin => ʃˈɪn`.
- `ʒ`: The "zh" sound, like `Asia => ˈAʒə`.
- `ð`: Soft "th" sound, like `than => ðən`.
- `θ`: Hard "th" sound, like `thin => θˈɪn`.
**Consonant Clusters (2)**
- `ʤ`: A "j" or "dg" sound, merges `dʒ`, like `jump => ʤˈʌmp` or `lunge => lˈʌnʤ`.
- `ʧ`: The "ch" sound, merges `tʃ`, like `chump => ʧˈʌmp` or `lunch => lˈʌnʧ`.
**IPA Vowels (10)**
- `ə`: The schwa is a common, unstressed vowel sound, like `a 🍌 => ə 🍌`.
- `i`: As in `easy => ˈizi`.
- `u`: As in `flu => flˈu`.
- `ɑ`: As in `spa => spˈɑ`.
- `ɔ`: As in `all => ˈɔl`.
- `ɛ`: As in `hair => hˈɛɹ` or `bed => bˈɛd`. Possibly dubious, because those vowel sounds do not sound similar to my ear.
- `ɜ`: As in `her => hɜɹ`. Easy to confuse with `ɛ` above.
- `ɪ`: As in `brick => bɹˈɪk`.
- `ʊ`: As in `wood => wˈʊd`.
- `ʌ`: As in `sun => sˈʌn`.
**Dipthong Vowels (4)**
- `A`: The "eh" vowel sound, like `hey => hˈA`. Expands to `eɪ` in IPA.
- `I`: The "eye" vowel sound, like `high => hˈI`. Expands to `aɪ` in IPA.
- `W`: The "ow" vowel sound, like `how => hˌW`. Expands to `aʊ` in IPA.
- `Y`: The "oy" vowel sound, like `soy => sˈY`. Expands to `ɔɪ` in IPA.
**Custom Vowel (1)**
- `ᵊ`: Small schwa, muted version of `ə`, like `pixel => pˈɪksᵊl`. I made this one up, so I'm not entirely sure if it's correct.
### 🇺🇸 American-only (4)
**Vowels (3)**
- `æ`: The vowel sound at the start of `ash => ˈæʃ`.
- `O`: Capital letter representing the American "oh" vowel sound. Expands to `oʊ` in IPA.
- `ᵻ`: A sound somewhere in between `ə` and `ɪ`, often used in certain -s suffixes like `boxes => bˈɑksᵻz`.
**Consonant (1)**
- `ɾ`: A sound somewhere in between `t` and `d`, like `butter => bˈʌɾəɹ`.
### 🇬🇧 British-only (4)
**Vowels (3)**
- `a`: The vowel sound at the start of `ash => ˈaʃ`.
- `Q`: Capital letter representing the British "oh" vowel sound. Expands to `əʊ` in IPA.
- `ɒ`: The sound at the start of `on => ˌɒn`. Easy to confuse with `ɑ`, which is a shared phoneme.
**Other (1)**
- `ː`: Vowel extender, visually looks similar to a colon. Possibly dubious, because Americans extend vowels too, but the gold US dictionary somehow lacks these. Often used by the Brits instead of `ɹ`: Americans say `or => ɔɹ`, but Brits say `or => ɔː`.
### ♻️ Misaki to espeak
```py
def to_espeak(ps):
# Optionally, you can add a tie character in between the 2 replacement characters.
ps = ps.replace('ʤ', 'dʒ').replace('ʧ', 'tʃ')
ps = ps.replace('A', 'eɪ').replace('I', 'aɪ').replace('Y', 'ɔɪ')
ps = ps.replace('O', 'oʊ').replace('Q', 'əʊ').replace('W', 'aʊ')
return ps.replace('ᵊ', 'ə')
```
''')
demo = gr.TabbedInterface(
[app, info],
['🔥 Misaki English', 'ℹ️ Phonemes'],
)
demo.launch() |