File size: 8,262 Bytes
68436b6
bdba9cc
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4700bf2
 
 
 
 
68436b6
 
 
 
 
bdba9cc
 
 
571ce43
bdba9cc
 
 
 
 
 
 
 
 
 
 
 
 
4700bf2
 
 
 
 
 
 
 
b524023
4700bf2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bdba9cc
4700bf2
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
import os
import subprocess
subprocess.run(['apt-get', 'update'])
subprocess.run(['apt-get', 'install', '-y', 'build-essential', 'gawk', 'libasound2-dev', 'libpulse-dev', 'autoconf', 'automake', 'libtool'])
subprocess.run(['wget', 'https://github.com/espeak-ng/espeak-ng/archive/refs/tags/1.52.0.tar.gz'])
subprocess.run(['tar', 'xf', '1.52.0.tar.gz'])
cwd = 'espeak-ng-1.52.0'
subprocess.run(['./autogen.sh'], cwd=cwd)
subprocess.run(['./configure'], cwd=cwd)
subprocess.run(['make'], cwd=cwd)
subprocess.run(['make', 'install'], cwd=cwd)
del cwd
env = os.environ.copy()
env['LD_PRELOAD'] = '/usr/local/lib/libespeak-ng.so.1'
subprocess.run(['espeak-ng', '--version'], env=env)
from phonemizer.backend.espeak.wrapper import EspeakWrapper
EspeakWrapper.set_library('/usr/local/lib/libespeak-ng.so.1')

import spaces
@spaces.GPU
def greet(n):
    return f"Hello {zero + n} Tensor"

from misaki import en, espeak
import gradio as gr
import pprint
import time

fbs = [espeak.EspeakFallback(british=british) for british in (False, True)]
g2p = [[en.G2P(trf=trf, british=british, fallback=fbs[british]) for british in (False, True)] for trf in (False, True)]

def predict(text, use_spacy_transformer, british):
    start = time.time()
    ps, tokens = g2p[use_spacy_transformer][british](text)
    debug = []
    for word in tokens:
        if isinstance(word, list):
            debug.append([])
            for t in word:
                debug[-1].append(t.debug_all())
        else:
            debug.append(word.debug_all())
    trace = pprint.pformat(debug)
    elapsed_cpu_time = time.time() - start
    return ps, len(ps), trace, elapsed_cpu_time

with gr.Blocks() as app:
    gr.Markdown('''
Misaki is an experimental G2P engine designed to power future versions of Kokoro models.

This English-only preview is primarily intended for researchers and linguists. It may be deeply uninteresting to most people.
''', container=True)
    gr.Interface(fn=predict, inputs=[gr.Text(), gr.Checkbox(), gr.Checkbox()],
                 outputs=[gr.Text(label='phonemes'), gr.Number(label='token_count <= 510 fits in Kokoro context length'), gr.Text(label='trace'), gr.Number(label='elapsed_cpu_time')])
    gr.Markdown('''
### Examples
```md
American: [Misaki](/misˈɑki/) is an experimental G2P engine designed to power future versions of [Kokoro](/kˈOkəɹO/) models.

British: [Misaki](/misˈɑːki/) is an experimental G2P engine designed to power future versions of [Kokoro](/kˈQkəɹQ/) models.

But I am the Chosen One. But I [am](+1) the Chosen [One](-1). But I [am](+2) the Chosen [One](-2).

1002. [1002](#a#). [1002](#an#). [1002](#a&#). 2025. 2,025. $45.67 billion trillion.
```
''', container=True)
    gr.Markdown('''
### Token-Level Trace
```py
# 1. Text. Can be useful for aligning text to phonemes, e.g. highlighting text during audio playback.

# 2. Tag. See a full list of tags from spaCy:
# https://github.com/explosion/spaCy/blob/master/spacy/glossary.py

# 3. Whitespace. Whether or not a token has trailing whitespace (string => bool for this demo).
whitespace = True if whitespace else False

# 4. Phonemes. For this demo, the question mark means UNK, the ninja emoji means empty string.
phonemes = '❓' if phonemes is None else ('🥷' if phonemes == '' else phonemes)

# 5. Rating. Star rating for the estimated quality of this token's phonemes.
ratings = dict(
user_override = '💎(5/5)',
gold = '🏆(4/5)',
silver = '🥈(3/5)',
bronze = '🥉(2/5)',
unk = '❓(UNK)',
)
```
''', container=True)
    gr.Markdown('''
### Notes
- For English, Misaki uses a gold dictionary with 80k words and a similarly sized silver dictionary.
- There are separate dictionaries for American & British English.
- Users can override the dictionary and/or individual tokens with custom pronunciations.
- `espeak-ng` is used as the fallback for OOD words, and the token is rated "bronze" in this case.
- Raw token objects are returned, with phonemes aligned at the per-token level.
- UNKs are easy to detect when `token.phonemes is None`.
- The entire implementation of Misaki (English) is <1000 lines of Python, excluding dictionary files.
- POS disambiguation should be live, e.g. to wound someone vs wound up.
- use_spacy_transformer should deliver more reliable POS tags.
- Non-POS-based disambiguation, like graph axes vs throwing axes, is still a TODO.
''', container=True)

with gr.Blocks() as info:
    gr.Markdown('''
# Misaki English Phonemes

For English, Misaki currently uses 49 total phonemes. Of these, 41 are shared by both Americans and Brits, 4 are American-only, and 4 are British-only.

Disclaimer: Author is an ML researcher, not a linguist, and may have butchered or reappropriated the traditional meaning of some symbols. These symbols are intended as input tokens for neural networks to yield optimal performance.


### 🤝 Shared (41)

**Stress Marks (2)**
- `ˈ`: Primary stress, visually looks similar to an apostrophe.
- `ˌ`: Secondary stress.

**IPA Consonants (22)**
- `bdfhjklmnpstvwz`: 15 alpha consonants taken from IPA. They mostly sound as you'd expect, but `j` actually represents the "y" sound, like `yes => jˈɛs`.
- `ɡ`: Hard "g" sound, like `get => ɡɛt`. Visually looks like the lowercase letter g, but its actually `U+0261`.
- `ŋ`: The "ng" sound, like `sung => sˈʌŋ`.
- `ɹ`: Upside-down r is just an "r" sound, like `red => ɹˈɛd`.
- `ʃ`: The "sh" sound, like `shin => ʃˈɪn`.
- `ʒ`: The "zh" sound, like `Asia => ˈAʒə`.
- `ð`: Soft "th" sound, like `than => ðən`.
- `θ`: Hard "th" sound, like `thin => θˈɪn`.

**Consonant Clusters (2)**
- `ʤ`: A "j" or "dg" sound, merges `dʒ`, like `jump => ʤˈʌmp` or `lunge => lˈʌnʤ`.
- `ʧ`: The "ch" sound, merges `tʃ`, like `chump => ʧˈʌmp` or `lunch => lˈʌnʧ`.

**IPA Vowels (10)**
- `ə`: The schwa is a common, unstressed vowel sound, like `a 🍌 => ə 🍌`.
- `i`: As in `easy => ˈizi`.
- `u`: As in `flu => flˈu`.
- `ɑ`: As in `spa => spˈɑ`.
- `ɔ`: As in `all => ˈɔl`.
- `ɛ`: As in `hair => hˈɛɹ` or `bed => bˈɛd`. Possibly dubious, because those vowel sounds do not sound similar to my ear.
- `ɜ`: As in `her => hɜɹ`. Easy to confuse with `ɛ` above.
- `ɪ`: As in `brick => bɹˈɪk`.
- `ʊ`: As in `wood => wˈʊd`.
- `ʌ`: As in `sun => sˈʌn`.

**Dipthong Vowels (4)**
- `A`: The "eh" vowel sound, like `hey => hˈA`. Expands to `eɪ` in IPA.
- `I`: The "eye" vowel sound, like `high => hˈI`. Expands to `aɪ` in IPA.
- `W`: The "ow" vowel sound, like `how => hˌW`. Expands to `aʊ` in IPA.
- `Y`: The "oy" vowel sound, like `soy => sˈY`. Expands to `ɔɪ` in IPA.

**Custom Vowel (1)**
- `ᵊ`: Small schwa, muted version of `ə`, like `pixel => pˈɪksᵊl`. I made this one up, so I'm not entirely sure if it's correct.


### 🇺🇸 American-only (4)

**Vowels (3)**
- `æ`: The vowel sound at the start of `ash => ˈæʃ`.
- `O`: Capital letter representing the American "oh" vowel sound. Expands to `oʊ` in IPA.
- `ᵻ`: A sound somewhere in between `ə` and `ɪ`, often used in certain -s suffixes like `boxes => bˈɑksᵻz`.

**Consonant (1)**
- `ɾ`: A sound somewhere in between `t` and `d`, like `butter => bˈʌɾəɹ`.


### 🇬🇧 British-only (4)

**Vowels (3)**
- `a`: The vowel sound at the start of `ash => ˈaʃ`.
- `Q`: Capital letter representing the British "oh" vowel sound. Expands to `əʊ` in IPA.
- `ɒ`: The sound at the start of `on => ˌɒn`. Easy to confuse with `ɑ`, which is a shared phoneme.

**Other (1)**
- `ː`: Vowel extender, visually looks similar to a colon. Possibly dubious, because Americans extend vowels too, but the gold US dictionary somehow lacks these. Often used by the Brits instead of `ɹ`: Americans say `or => ɔɹ`, but Brits say `or => ɔː`.


### ♻️ Misaki to espeak
```py
def to_espeak(ps):
    # Optionally, you can add a tie character in between the 2 replacement characters.
    ps = ps.replace('ʤ', 'dʒ').replace('ʧ', 'tʃ')
    ps = ps.replace('A', 'eɪ').replace('I', 'aɪ').replace('Y', 'ɔɪ')
    ps = ps.replace('O', 'oʊ').replace('Q', 'əʊ').replace('W', 'aʊ')
    return ps.replace('ᵊ', 'ə')
```
''')

demo = gr.TabbedInterface(
    [app, info],
    ['🔥 Misaki English', 'ℹ️ Phonemes'],
)

demo.launch()