w11wo commited on
Commit
84b3e03
1 Parent(s): b68ecb4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +194 -0
README.md CHANGED
@@ -1,3 +1,197 @@
1
  ---
 
 
 
2
  license: apache-2.0
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - id
4
+ - ms
5
  license: apache-2.0
6
+ tags:
7
+ - g2p
8
+ inference: false
9
  ---
10
+
11
+ # ID G2P LSTM
12
+
13
+ ID G2P LSTM is a grapheme-to-phoneme model based on the [LSTM](https://doi.org/10.1162/neco.1997.9.8.1735) architecture. This model was trained from scratch on a modified [Malay/Indonesian lexicon](https://huggingface.co/datasets/bookbot/id_word2phoneme).
14
+
15
+ This model was trained using the [Keras](https://keras.io/) framework. All training was done on Google Colaboratory. We adapted the [LSTM training script](https://keras.io/examples/nlp/lstm_seq2seq/) provided by the official Keras Code Example.
16
+
17
+ ## Model
18
+
19
+ | Model | #params | Arch. | Training/Validation data |
20
+ | ------------- | ------- | ----- | ------------------------ |
21
+ | `id-g2p-lstm` | 596K | LSTM | Malay/Indonesian Lexicon |
22
+
23
+ ## Training Procedure
24
+
25
+ <details>
26
+ <summary>Model Config</summary>
27
+
28
+ latent_dim: 256
29
+ num_encoder_tokens: 28
30
+ num_decoder_tokens: 32
31
+ max_encoder_seq_length: 24
32
+ max_decoder_seq_length: 25
33
+
34
+ </details>
35
+
36
+ <details>
37
+ <summary>Training Setting</summary>
38
+
39
+ batch_size: 64
40
+ optimizer: "rmsprop"
41
+ loss: "categorical_crossentropy"
42
+ learning_rate: 0.001
43
+ epochs: 100
44
+
45
+ </details>
46
+
47
+ ## How to Use
48
+
49
+ <details>
50
+ <summary>Tokenizers</summary>
51
+
52
+ g2id = {
53
+ ' ': 27,
54
+ '-': 0,
55
+ 'a': 1,
56
+ 'b': 2,
57
+ 'c': 3,
58
+ 'd': 4,
59
+ 'e': 5,
60
+ 'f': 6,
61
+ 'g': 7,
62
+ 'h': 8,
63
+ 'i': 9,
64
+ 'j': 10,
65
+ 'k': 11,
66
+ 'l': 12,
67
+ 'm': 13,
68
+ 'n': 14,
69
+ 'o': 15,
70
+ 'p': 16,
71
+ 'q': 17,
72
+ 'r': 18,
73
+ 's': 19,
74
+ 't': 20,
75
+ 'u': 21,
76
+ 'v': 22,
77
+ 'w': 23,
78
+ 'y': 24,
79
+ 'z': 25,
80
+ '’': 26
81
+ }
82
+
83
+ p2id = {
84
+ '\t': 0,
85
+ '\n': 1,
86
+ ' ': 31,
87
+ '-': 2,
88
+ 'a': 3,
89
+ 'b': 4,
90
+ 'd': 5,
91
+ 'e': 6,
92
+ 'f': 7,
93
+ 'g': 8,
94
+ 'h': 9,
95
+ 'i': 10,
96
+ 'j': 11,
97
+ 'k': 12,
98
+ 'l': 13,
99
+ 'm': 14,
100
+ 'n': 15,
101
+ 'o': 16,
102
+ 'p': 17,
103
+ 'r': 18,
104
+ 's': 19,
105
+ 't': 20,
106
+ 'u': 21,
107
+ 'v': 22,
108
+ 'w': 23,
109
+ 'z': 24,
110
+ 'ŋ': 25,
111
+ 'ə': 26,
112
+ 'ɲ': 27,
113
+ 'ʃ': 28,
114
+ 'ʒ': 29,
115
+ 'ʔ': 30
116
+ }
117
+
118
+ </details>
119
+
120
+ ```py
121
+ import keras
122
+ import numpy as np
123
+ from huggingface_hub import from_pretrained_keras
124
+
125
+ latent_dim = 256
126
+ bos_token, eos_token, pad_token = "\t", "\n", " "
127
+ num_encoder_tokens, num_decoder_tokens = 28, 32
128
+ max_encoder_seq_length, max_decoder_seq_length = 24, 25
129
+
130
+ model = from_pretrained_keras("bookbot/id-g2p-lstm")
131
+
132
+ encoder_inputs = model.input[0]
133
+ encoder_outputs, state_h_enc, state_c_enc = model.layers[2].output
134
+ encoder_states = [state_h_enc, state_c_enc]
135
+ encoder_model = keras.Model(encoder_inputs, encoder_states)
136
+
137
+ decoder_inputs = model.input[1]
138
+
139
+ decoder_state_input_h = keras.Input(shape=(latent_dim,), name="input_3")
140
+ decoder_state_input_c = keras.Input(shape=(latent_dim,), name="input_4")
141
+ decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
142
+ decoder_lstm = model.layers[3]
143
+ decoder_outputs, state_h_dec, state_c_dec = decoder_lstm(
144
+ decoder_inputs, initial_state=decoder_states_inputs
145
+ )
146
+ decoder_states = [state_h_dec, state_c_dec]
147
+ decoder_dense = model.layers[4]
148
+ decoder_outputs = decoder_dense(decoder_outputs)
149
+ decoder_model = keras.Model(
150
+ [decoder_inputs] + decoder_states_inputs, [decoder_outputs] + decoder_states
151
+ )
152
+
153
+ def inference(sequence):
154
+ id2p = {v: k for k, v in p2id.items()}
155
+
156
+ input_seq = np.zeros(
157
+ (1, max_encoder_seq_length, num_encoder_tokens), dtype="float32"
158
+ )
159
+
160
+ for t, char in enumerate(sequence):
161
+ input_seq[0, t, g2id[char]] = 1.0
162
+ input_seq[0, t + 1 :, g2id[pad_token]] = 1.0
163
+
164
+ states_value = encoder_model.predict(input_seq)
165
+
166
+ target_seq = np.zeros((1, 1, num_decoder_tokens))
167
+ target_seq[0, 0, p2id[bos_token]] = 1.0
168
+
169
+ stop_condition = False
170
+ decoded_sentence = ""
171
+ while not stop_condition:
172
+ output_tokens, h, c = decoder_model.predict([target_seq] + states_value)
173
+
174
+ sampled_token_index = np.argmax(output_tokens[0, -1, :])
175
+ sampled_char = id2p[sampled_token_index]
176
+ decoded_sentence += sampled_char
177
+
178
+ if sampled_char == eos_token or len(decoded_sentence) > max_decoder_seq_length:
179
+ stop_condition = True
180
+
181
+ target_seq = np.zeros((1, 1, num_decoder_tokens))
182
+ target_seq[0, 0, sampled_token_index] = 1.0
183
+
184
+ states_value = [h, c]
185
+ return decoded_sentence.replace(eos_token, "")
186
+
187
+ inference("mengembangkannya")
188
+ ```
189
+
190
+ ## Authors
191
+
192
+ ID G2P LSTM was trained and evaluated by [Ananto Joyoadikusumo](https://anantoj.github.io/), [Steven Limcorn](https://stevenlimcorn.github.io/), [Wilson Wongso](https://w11wo.github.io/). All computation and development are done on AWS Sagemaker.
193
+
194
+ ## Framework versions
195
+
196
+ - Keras 2.8.0
197
+ - TensorFlow 2.8.0