Spaces:

Francis0917
/

CL-KWS_202408_v1

Runtime error

App Files Files Community

Francis0917 commited on Sep 12, 2024

Commit

2045faa

verified ·

1 Parent(s): 55d46a2

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +3 -0
README.md +111 -8
checkpoint_results/checkpoint_gctc_clap/20240725-154258/checkpoint +2 -0
checkpoint_results/checkpoint_gctc_clap/20240725-154258/ckpt-29.data-00000-of-00001 +3 -0
checkpoint_results/checkpoint_gctc_clap/20240725-154258/ckpt-29.index +0 -0
checkpoint_results/checkpoint_guided_ctc/20240725-011006/checkpoint +2 -0
checkpoint_results/checkpoint_guided_ctc/20240725-011006/ckpt-23.data-00000-of-00001 +3 -0
checkpoint_results/checkpoint_guided_ctc/20240725-011006/ckpt-23.index +0 -0
criterion/__pycache__/total.cpython-37.pyc +0 -0
criterion/__pycache__/total_ctc1_clap.cpython-37.pyc +0 -0
criterion/__pycache__/utils.cpython-37.pyc +0 -0
criterion/total.py +69 -0
criterion/total_CLKWS.py +100 -0
criterion/total_ctc1.py +97 -0
criterion/total_ctc1_clap.py +125 -0
criterion/utils.py +32 -0
dataset/__pycache__/dataloader_demo.cpython-37.pyc +0 -0
dataset/__pycache__/dataloader_infe.cpython-37.pyc +0 -0
dataset/__pycache__/google.cpython-37.pyc +0 -0
dataset/__pycache__/google_infe202405.cpython-37.pyc +0 -0
dataset/__pycache__/libriphrase.cpython-37.pyc +0 -0
dataset/__pycache__/libriphrase_ctc1.cpython-37.pyc +0 -0
dataset/__pycache__/qualcomm.cpython-37.pyc +0 -0
dataset/dataloader_demo.py +182 -0
dataset/dataloader_infe.py +164 -0
dataset/g2p/LICENSE.txt +201 -0
dataset/g2p/g2p_en/__init__.py +1 -0
dataset/g2p/g2p_en/__pycache__/__init__.cpython-37.pyc +0 -0
dataset/g2p/g2p_en/__pycache__/expand.cpython-37.pyc +0 -0
dataset/g2p/g2p_en/__pycache__/g2p.cpython-37.pyc +0 -0
dataset/g2p/g2p_en/checkpoint20.npz +3 -0
dataset/g2p/g2p_en/expand.py +79 -0
dataset/g2p/g2p_en/g2p.py +249 -0
dataset/g2p/g2p_en/homographs.en +379 -0
dataset/google.py +188 -0
dataset/google_infe202405.py +192 -0
dataset/libriphrase.py +331 -0
dataset/libriphrase_ctc1.py +346 -0
dataset/qualcomm.py +180 -0
demo.py +168 -0
docker/Dockerfile +25 -0
flagged/Sound/c129aef35ba4cb66620f813cd7268c4be510a66d/ok_google-183000.wav +0 -0
flagged/Sound/d35a5cf80a9403828bc601a0a761a5f88da06f00/realtek_go-183033.wav +0 -0
flagged/log.csv +8 -0
inference.py +141 -0
model/__pycache__/discriminator.cpython-37.pyc +0 -0
model/__pycache__/encoder.cpython-37.pyc +0 -0
model/__pycache__/extractor.cpython-37.pyc +0 -0
model/__pycache__/log_melspectrogram.cpython-37.pyc +0 -0
model/__pycache__/speech_embedding.cpython-37.pyc +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+checkpoint_results/checkpoint_gctc_clap/20240725-154258/ckpt-29.data-00000-of-00001 filter=lfs diff=lfs merge=lfs -text
+checkpoint_results/checkpoint_guided_ctc/20240725-011006/ckpt-23.data-00000-of-00001 filter=lfs diff=lfs merge=lfs -text
+model/google_speech_embedding/variables/variables.data-00000-of-00001 filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,12 +1,115 @@
 ---
-title: CL-KWS 202408 V1
-emoji: 📈
-colorFrom: blue
-colorTo: green
 sdk: gradio
-sdk_version: 4.44.0
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: CL-KWS_202408_v1
+app_file: demo.py
 sdk: gradio
+sdk_version: 3.34.0
 ---
+### Datasets
+* [LibriPhrase]
+  LibriSpeech corpus : https://www.openslr.org/12
+  Recipe for LibriPhrase : https://github.com/gusrud1103/LibriPhrase
+* [Google Speech Commands]
+  http://download.tensorflow.org/data/speech_commands_v0.02.tar.gz
+  http://download.tensorflow.org/data/speech_commands_test_set_v0.02.tar.gz
+  https://www.tensorflow.org/datasets/catalog/speech_commands
+* [Qualcomm Keyword Speech]
+  https://www.qualcomm.com/developer/software/keyword-speech-dataset
+*[noise][musan]
+  https://www.openslr.org/17/
+## Getting started
+### Environment
+```bash
+#python=3.7
+conda create --name [name] python=3.7
+conda install -c "nvidia/label/cuda-11.6.0" cuda-nvcc
+conda install -c conda-forge cudnn=8.2.1.32
+pip install -r requirements.txt
+pip install numpy==1.18.5
+pip install tensorflow-model-optimization==0.6.0
+cd /miniconda3/envs/[name]/lib
+ln -s libcusolver.so.11 libcusolver.so.10
+# export export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/share/homes/yiting/miniconda3/envs/pho/lib
+```
+### Training
+```bash
+python train_guided_CTC.py\
+        --epoch 23 \
+        --lr 1e-3 \
+        --loss_weight 1.0 1.0 0.2\
+        --audio_input both \
+        --text_input phoneme \
+        --comment 'user comments for each experiment'
+```
+```bash
+python train.py  \
+        --epoch 18 \
+        --lr 1e-3 \
+        --loss_weight 1.0 1.0 \
+        --audio_input both \
+        --text_input phoneme \
+        --comment 'user comments for each experiment'
+```
+### Fine-tuning
+checkpoint: ./checkpoint_results/checkpoint_guided_ctc/20240725-011006
+```bash
+python train_guided_ctc_clap.py \
+        --epoch 5 \
+        --lr 1e-3 \
+        --loss_weight 1.0 1.0 0.01 0.01 \
+        --audio_input both \
+        --text_input phoneme \
+	--load_checkpoint_path '/home/DB/checkpoint_results/checkpoint_guided_ctc/date-time' \
+        --comment 'user comments for each experiment'
+```
+```bash
+python train_CLKWS.py \
+        --epoch 4 \
+        --lr 1e-3 \
+        --loss_weight 1.0 1.0 \
+        --audio_input both \
+        --text_input phoneme \
+	      --load_checkpoint_path '/home/DB/checkpoint_results/checkpoint/date-time' \
+        --comment 'user comments for each experiment'
+```
+### Inference
+keyword list is target_list in google_infe202405.py
+```bash
+python inference.py      --audio_input both         --text_input phoneme    --load_checkpoint_path 'home/DB/checkpoint_results/checkpoint/20240515-111757'
+```
+### Demo
+checkpoint:checkpoint: ./checkpoint_results/checkpoint_guided_ctc/20240725-011006
+                       ./checkpoint_results/checkpoint_gctc_clap/20240725-154258
+```bash
+python demo.py      --audio_input both         --text_input phoneme    --load_checkpoint_path '/home/DB/checkpoint_results/checkpoint_guided_ctc/20240725-011006' --keyword_list_length 8
+```
+Demo website :Running on public URL
+upload file: MONO, WAV, 256kbps, 22050hz
+dataset/dataloader_demo.py :  self.maxlen_a = 56000
+### Monitoring
+```bash
+tensorboard --logdir ./log/ --bind_all
+```
+### Acknownoledge
+We acknowledge the following code repositories:
+https://github.com/ncsoft/PhonMatchNet

checkpoint_results/checkpoint_gctc_clap/20240725-154258/checkpoint ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ model_checkpoint_path: "ckpt-29"
2	+ all_model_checkpoint_paths: "ckpt-29"

checkpoint_results/checkpoint_gctc_clap/20240725-154258/ckpt-29.data-00000-of-00001 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:25da31f91bcff94540bf57296b058d07aaaa804c85ad59d5eaf9bc3f9803c62f
+size 1211835

checkpoint_results/checkpoint_gctc_clap/20240725-154258/ckpt-29.index ADDED Viewed

Binary file (2.23 kB). View file

checkpoint_results/checkpoint_guided_ctc/20240725-011006/checkpoint ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ model_checkpoint_path: "ckpt-23"
2	+ all_model_checkpoint_paths: "ckpt-23"

checkpoint_results/checkpoint_guided_ctc/20240725-011006/ckpt-23.data-00000-of-00001 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e0228d6d9c71e767409ff8d2a300eda7d8d115185d3b793699cae730715424aa
+size 3630878

checkpoint_results/checkpoint_guided_ctc/20240725-011006/ckpt-23.index ADDED Viewed

Binary file (6.37 kB). View file

criterion/__pycache__/total.cpython-37.pyc ADDED Viewed

Binary file (2.78 kB). View file

criterion/__pycache__/total_ctc1_clap.cpython-37.pyc ADDED Viewed

Binary file (4.29 kB). View file

criterion/__pycache__/utils.cpython-37.pyc ADDED Viewed

Binary file (1.52 kB). View file

criterion/total.py ADDED Viewed

	@@ -0,0 +1,69 @@

+import os, sys
+import tensorflow as tf
+import numpy as np
+from tensorflow.keras.losses import Loss, MeanSquaredError
+seed = 42
+tf.random.set_seed(seed)
+np.random.seed(seed)
+def sequence_cross_entropy(speech_label, text_label, logits, reduction='sum'):
+    """
+    args
+    speech_label        : [B, Ls]
+    text_label          : [B, Lt]
+    logits              : [B, Lt]
+    logits._keras_mask  : [B, Lt]
+    """
+    # Data pre-processing
+    if tf.shape(text_label)[1] > tf.shape(speech_label)[1]:
+        speech_label =  tf.pad(speech_label, [[0, 0],[0, tf.shape(text_label)[1] - tf.shape(speech_label)[1]]], 'CONSTANT', constant_values=0)
+    elif tf.shape(text_label)[1] < tf.shape(speech_label)[1]:
+        speech_label = speech_label[:, :text_label.shape[1]]
+    # Make paired data between text and speech phonemes
+    paired_label = tf.math.equal(text_label, speech_label)
+    paired_label = tf.cast(tf.math.logical_and(tf.cast(paired_label, tf.bool), tf.cast(logits._keras_mask, tf.bool)), tf.float32)
+    paired_label = tf.reshape(tf.ragged.boolean_mask(paired_label, tf.cast(logits._keras_mask, tf.bool)).flat_values, [-1,1])
+    logits = tf.reshape(tf.ragged.boolean_mask(logits, tf.cast(logits._keras_mask, tf.bool)).flat_values, [-1,1])
+    # Get BinaryCrossEntropy loss
+    BCE = tf.keras.losses.BinaryCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.SUM)
+    loss = BCE(paired_label, logits)
+    if reduction == 'sum':
+        loss = tf.math.divide_no_nan(loss, tf.cast(tf.shape(logits)[0], loss.dtype))
+        loss = tf.math.multiply_no_nan(loss, tf.cast(tf.shape(speech_label)[0], loss.dtype))
+    return loss
+def detection_loss(y_true, y_pred):
+    BFC = tf.keras.losses.BinaryCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.SUM)
+    return(BFC(y_true, y_pred))
+class TotalLoss(Loss):
+    def __init__(self, weight=1.0):
+        super().__init__()
+        self.weight = weight
+    def __call__(self, y_true, y_pred, reduction='sum'):
+        LD = detection_loss(y_true, y_pred)
+        return self.weight * LD, LD
+class TotalLoss_SCE(Loss):
+    def __init__(self, weight=[1.0, 1.0]):
+        super().__init__()
+        self.weight = weight
+    def __call__(self, y_true, y_pred, speech_label, text_label, logit, reduction='sum'):
+        if self.weight[0] != 0.0:
+            LD = detection_loss(y_true, y_pred)
+        else:
+            LD = 0
+        if self.weight[1] != 0.0:
+            LC = sequence_cross_entropy(speech_label, text_label, logit, reduction=reduction)
+        else:
+            LC = 0
+        return self.weight[0] * LD + self.weight[1] * LC, LD, LC

criterion/total_CLKWS.py ADDED Viewed

	@@ -0,0 +1,100 @@

+import os, sys
+import tensorflow as tf
+import numpy as np
+from tensorflow.keras.losses import Loss, MeanSquaredError
+import math
+seed = 42
+tf.random.set_seed(seed)
+np.random.seed(seed)
+def sequence_cross_entropy(speech_label, text_label, logits, reduction='sum'):
+    """
+    args
+    speech_label        : [B, Ls]
+    text_label          : [B, Lt]
+    logits              : [B, Lt]
+    logits._keras_mask  : [B, Lt]
+    """
+    # Data pre-processing
+    if tf.shape(text_label)[1] > tf.shape(speech_label)[1]:
+        speech_label =  tf.pad(speech_label, [[0, 0],[0, tf.shape(text_label)[1] - tf.shape(speech_label)[1]]], 'CONSTANT', constant_values=0)
+    elif tf.shape(text_label)[1] < tf.shape(speech_label)[1]:
+        speech_label = speech_label[:, :text_label.shape[1]]
+    # Make paired data between text and speech phonemes
+    paired_label = tf.math.equal(text_label, speech_label)
+    paired_label = tf.cast(tf.math.logical_and(tf.cast(paired_label, tf.bool), tf.cast(logits._keras_mask, tf.bool)), tf.float32)
+    paired_label = tf.reshape(tf.ragged.boolean_mask(paired_label, tf.cast(logits._keras_mask, tf.bool)).flat_values, [-1,1])
+    logits = tf.reshape(tf.ragged.boolean_mask(logits, tf.cast(logits._keras_mask, tf.bool)).flat_values, [-1,1])
+    # Get BinaryCrossEntropy loss
+    BCE = tf.keras.losses.BinaryCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.SUM)
+    loss = BCE(paired_label, logits)
+    if reduction == 'sum':
+        loss = tf.math.divide_no_nan(loss, tf.cast(tf.shape(logits)[0], loss.dtype))
+        loss = tf.math.multiply_no_nan(loss, tf.cast(tf.shape(speech_label)[0], loss.dtype))
+    return loss
+def detection_loss(y_true, y_pred):
+    BFC = tf.keras.losses.BinaryCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.SUM)
+    return(BFC(y_true, y_pred))
+def matrix_loss_0(y_true, y_pred):
+    MBC_0 = tf.keras.losses.CategoricalCrossentropy(from_logits=True,reduction=tf.keras.losses.Reduction.SUM)
+    return(MBC_0(y_true, y_pred))
+def matrix_loss_1(y_true, y_pred):
+    MBC_1 = tf.keras.losses.CategoricalCrossentropy(from_logits=True,reduction=tf.keras.losses.Reduction.SUM)
+    return(MBC_1(y_true, y_pred))
+class TotalLoss(Loss):
+    def __init__(self, weight=1.0):
+        super().__init__()
+        self.weight = weight
+    def __call__(self, y_true, y_pred, reduction='sum'):
+        LD = detection_loss(y_true, y_pred)
+        return self.weight * LD, LD
+class TotalLoss_SCE(Loss):
+    def __init__(self, weight=[1.0, 1.0]):
+        super().__init__()
+        self.weight = weight
+    def __call__(self, y_true, y_pred, speech_label, text_label, logit, prob, reduction='sum'):
+        if self.weight[0] != 0.0:
+            LD = detection_loss(y_true, y_pred)
+        else:
+            LD = 0
+        if self.weight[1] != 0.0:
+            LC = sequence_cross_entropy(speech_label, text_label, logit, reduction=reduction)
+        else:
+            LC = 0
+        number_1 = 5
+        number_2 = int(y_pred.shape[0]//number_1)
+        number_3 = int(y_pred.shape[0]//(number_1*number_1))
+        y_pred_1 = tf.reshape(prob,[number_2,number_1])
+        y_true_1 = tf.reshape(y_true,[number_2,number_1])
+        loss_audio = matrix_loss_0(y_true_1,y_pred_1)
+        x=tf.reshape(prob,[number_3,number_1,number_1])
+        x_transposed = tf.transpose(x, perm=[0, 2, 1])
+        y_pred_2 = tf.reshape(x_transposed,[number_2,number_1])
+        y = tf.reshape(y_true,[number_3,number_1,number_1])
+        y_transposed = tf.transpose(y,perm=[0, 2, 1])
+        y_true_2 = tf.reshape(y_transposed,[number_2,number_1])
+        loss_text = matrix_loss_1(y_true_2,y_pred_2)
+        loss = 0.5*loss_audio + 0.5*loss_text
+        return self.weight[0] * LD + self.weight[1] * LC + loss, LD, LC

criterion/total_ctc1.py ADDED Viewed

	@@ -0,0 +1,97 @@

+import os, sys
+import tensorflow as tf
+import numpy as np
+from tensorflow.keras.losses import Loss, MeanSquaredError
+seed = 42
+tf.random.set_seed(seed)
+np.random.seed(seed)
+def sequence_cross_entropy(speech_label, text_label, logits, reduction='sum'):
+    """
+    args
+    speech_label        : [B, Ls]
+    text_label          : [B, Lt]
+    logits              : [B, Lt]
+    logits._keras_mask  : [B, Lt]
+    """
+    # Data pre-processing
+    if tf.shape(text_label)[1] > tf.shape(speech_label)[1]:
+        speech_label =  tf.pad(speech_label, [[0, 0],[0, tf.shape(text_label)[1] - tf.shape(speech_label)[1]]], 'CONSTANT', constant_values=0)
+    elif tf.shape(text_label)[1] < tf.shape(speech_label)[1]:
+        speech_label = speech_label[:, :text_label.shape[1]]
+    # Make paired data between text and speech phonemes
+    paired_label = tf.math.equal(text_label, speech_label)
+    paired_label = tf.cast(tf.math.logical_and(tf.cast(paired_label, tf.bool), tf.cast(logits._keras_mask, tf.bool)), tf.float32)
+    paired_label = tf.reshape(tf.ragged.boolean_mask(paired_label, tf.cast(logits._keras_mask, tf.bool)).flat_values, [-1,1])
+    logits = tf.reshape(tf.ragged.boolean_mask(logits, tf.cast(logits._keras_mask, tf.bool)).flat_values, [-1,1])
+    # Get BinaryCrossEntropy loss
+    BCE = tf.keras.losses.BinaryCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.SUM)
+    loss = BCE(paired_label, logits)
+    if reduction == 'sum':
+        loss = tf.math.divide_no_nan(loss, tf.cast(tf.shape(logits)[0], loss.dtype))
+        loss = tf.math.multiply_no_nan(loss, tf.cast(tf.shape(speech_label)[0], loss.dtype))
+    return loss
+def detection_loss(y_true, y_pred):
+    BFC = tf.keras.losses.BinaryCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.SUM)
+    return(BFC(y_true, y_pred))
+def ctc_loss(affinity_matrix, speech_labels, text_labels,n_speech):
+    #logit_length
+    # n_speech = tf.math.reduce_sum(tf.cast(affinity_matrix._keras_mask, tf.float32), -1)
+    #logit
+    transposed_logits = tf.transpose(affinity_matrix, perm=[0, 2, 1])
+    # log_probs = tf.math.log(transposed_logits+ 1e-8)
+    # logits_approx = log_probs - tf.reduce_max(log_probs, axis=-1, keepdims=True)
+    #label
+    matches = tf.equal(speech_labels, text_labels)
+    indices = tf.range(text_labels.shape[1], dtype=tf.int32)
+    selected_indices = tf.where(matches, indices, tf.fill(tf.shape(text_labels), 0))
+    labels = tf.where(tf.equal(text_labels, 0), text_labels, selected_indices)
+    #label_length
+    label_length = tf.math.count_nonzero(labels, axis=1)
+    ctc_loss = tf.nn.ctc_loss(labels,transposed_logits,label_length,n_speech,
+                   logits_time_major=False,
+                    unique=None,
+                    blank_index=0,
+                    name=None)
+    return ctc_loss
+class TotalLoss(Loss):
+    def __init__(self, weight=1.0):
+        super().__init__()
+        self.weight = weight
+    def __call__(self, y_true, y_pred, reduction='sum'):
+        LD = detection_loss(y_true, y_pred)
+        return self.weight * LD, LD
+class TotalLoss_SCE(Loss):
+    def __init__(self, weight=[1.0, 1.0, 0.2]):
+        super().__init__()
+        self.weight = weight
+    def __call__(self, y_true, y_pred, speech_label, text_label, logit,affinity_matrix,n_speech, reduction='sum'):
+        ctc = ctc_loss(affinity_matrix, speech_label, text_label,n_speech)
+        if self.weight[0] != 0.0:
+            LD = detection_loss(y_true, y_pred)
+        else:
+            LD = 0
+        if self.weight[1] != 0.0:
+            LC = sequence_cross_entropy(speech_label, text_label, logit, reduction=reduction)
+        else:
+            LC = 0
+        return self.weight[0] * LD + self.weight[1] * LC + self.weight[2]*ctc, LD, LC

criterion/total_ctc1_clap.py ADDED Viewed

	@@ -0,0 +1,125 @@

+import os, sys
+import tensorflow as tf
+import numpy as np
+from tensorflow.keras.losses import Loss, MeanSquaredError
+seed = 42
+tf.random.set_seed(seed)
+np.random.seed(seed)
+def sequence_cross_entropy(speech_label, text_label, logits, reduction='sum'):
+    """
+    args
+    speech_label        : [B, Ls]
+    text_label          : [B, Lt]
+    logits              : [B, Lt]
+    logits._keras_mask  : [B, Lt]
+    """
+    # Data pre-processing
+    if tf.shape(text_label)[1] > tf.shape(speech_label)[1]:
+        speech_label =  tf.pad(speech_label, [[0, 0],[0, tf.shape(text_label)[1] - tf.shape(speech_label)[1]]], 'CONSTANT', constant_values=0)
+    elif tf.shape(text_label)[1] < tf.shape(speech_label)[1]:
+        speech_label = speech_label[:, :text_label.shape[1]]
+    # Make paired data between text and speech phonemes
+    paired_label = tf.math.equal(text_label, speech_label)
+    paired_label = tf.cast(tf.math.logical_and(tf.cast(paired_label, tf.bool), tf.cast(logits._keras_mask, tf.bool)), tf.float32)
+    paired_label = tf.reshape(tf.ragged.boolean_mask(paired_label, tf.cast(logits._keras_mask, tf.bool)).flat_values, [-1,1])
+    logits = tf.reshape(tf.ragged.boolean_mask(logits, tf.cast(logits._keras_mask, tf.bool)).flat_values, [-1,1])
+    # Get BinaryCrossEntropy loss
+    BCE = tf.keras.losses.BinaryCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.SUM)
+    loss = BCE(paired_label, logits)
+    if reduction == 'sum':
+        loss = tf.math.divide_no_nan(loss, tf.cast(tf.shape(logits)[0], loss.dtype))
+        loss = tf.math.multiply_no_nan(loss, tf.cast(tf.shape(speech_label)[0], loss.dtype))
+    return loss
+def matrix_loss_0(y_true, y_pred):
+    MBC_0 = tf.keras.losses.CategoricalCrossentropy(from_logits=True,reduction=tf.keras.losses.Reduction.SUM)
+    return(MBC_0(y_true, y_pred))
+def matrix_loss_1(y_true, y_pred):
+    MBC_1 = tf.keras.losses.CategoricalCrossentropy(from_logits=True,reduction=tf.keras.losses.Reduction.SUM)
+    return(MBC_1(y_true, y_pred))
+def detection_loss(y_true, y_pred):
+    BFC = tf.keras.losses.BinaryCrossentropy(from_logits=True, reduction=tf.keras.losses.Reduction.SUM)
+    return(BFC(y_true, y_pred))
+def ctc_loss(affinity_matrix, speech_labels, text_labels,n_speech):
+    #logit_length
+    # n_speech = tf.math.reduce_sum(tf.cast(affinity_matrix._keras_mask, tf.float32), -1)
+    #logit
+    transposed_logits = tf.transpose(affinity_matrix, perm=[0, 2, 1])
+    # log_probs = tf.math.log(transposed_logits+ 1e-8)
+    # logits_approx = log_probs - tf.reduce_max(log_probs, axis=-1, keepdims=True)
+    #label
+    matches = tf.equal(speech_labels, text_labels)
+    indices = tf.range(text_labels.shape[1], dtype=tf.int32)
+    selected_indices = tf.where(matches, indices, tf.fill(tf.shape(text_labels), 0))
+    labels = tf.where(tf.equal(text_labels, 0), text_labels, selected_indices)
+    #label_length
+    label_length = tf.math.count_nonzero(labels, axis=1)
+    # mask = tf.not_equal(labels, 0)
+    # # 应用mask，使用 tf.ragged.boolean_mask 来处理不同长度的数据
+    # labels = tf.ragged.boolean_mask(labels, mask)
+    ctc_loss = tf.nn.ctc_loss(labels,transposed_logits,label_length,n_speech,
+                   logits_time_major=False,
+                    unique=None,
+                    blank_index=0,
+                    name=None)
+    return ctc_loss
+class TotalLoss(Loss):
+    def __init__(self, weight=1.0):
+        super().__init__()
+        self.weight = weight
+    def __call__(self, y_true, y_pred, reduction='sum'):
+        LD = detection_loss(y_true, y_pred)
+        return self.weight * LD, LD
+class TotalLoss_SCE(Loss):
+    def __init__(self, weight=[1.0, 1.0, 0.01, 0.01]):
+        super().__init__()
+        self.weight = weight
+    def __call__(self, y_true, y_pred, speech_label, text_label, logit,prob,affinity_matrix,n_speech, reduction='sum'):
+        ctc = ctc_loss(affinity_matrix, speech_label, text_label,n_speech)
+        number_1 = 5
+        number_2 = int(y_pred.shape[0]//number_1)
+        number_3 = int(y_pred.shape[0]//(number_1*number_1))
+        y_pred_1 = tf.reshape(prob,[number_2,number_1])
+        y_true_1 = tf.reshape(y_true,[number_2,number_1])
+        loss_audio = matrix_loss_0(y_true_1,y_pred_1)
+        x=tf.reshape(prob,[number_3,number_1,number_1])
+        x_transposed = tf.transpose(x, perm=[0, 2, 1])
+        y_pred_2 = tf.reshape(x_transposed,[number_2,number_1])
+        y = tf.reshape(y_true,[number_3,number_1,number_1])
+        y_transposed = tf.transpose(y,perm=[0, 2, 1])
+        y_true_2 = tf.reshape(y_transposed,[number_2,number_1])
+        loss_text = matrix_loss_1(y_true_2,y_pred_2)
+        loss = 0.5*loss_audio + 0.5*loss_text
+        if self.weight[0] != 0.0:
+            LD = detection_loss(y_true, y_pred)
+        else:
+            LD = 0
+        if self.weight[1] != 0.0:
+            LC = sequence_cross_entropy(speech_label, text_label, logit, reduction=reduction)
+        else:
+            LC = 0
+        return self.weight[0] * LD + self.weight[1] * LC + self.weight[2]*ctc + self.weight[3]*loss, LD, LC

criterion/utils.py ADDED Viewed

	@@ -0,0 +1,32 @@

+import numpy as np
+import sklearn.metrics
+import tensorflow as tf
+def compute_eer(label, pred):
+    # all fpr, tpr, fnr, fnr, threshold are lists (in the format of np.array)
+    fpr, tpr, threshold = sklearn.metrics.roc_curve(label, pred)
+    fnr = 1 - tpr
+    # the threshold of fnr == fpr
+    eer_threshold = threshold[np.nanargmin(np.absolute((fnr - fpr)))]
+    # theoretically eer from fpr and eer from fnr should be identical but they can be slightly differ in reality
+    eer_1 = fpr[np.nanargmin(np.absolute((fnr - fpr)))]
+    eer_2 = fnr[np.nanargmin(np.absolute((fnr - fpr)))]
+    # return the mean of eer from fpr and from fnr
+    eer = (eer_1 + eer_2) / 2
+    return eer
+class eer(tf.keras.metrics.Metric):
+    def __init__(self, name='equal_error_rate', **kwargs):
+        super(eer, self).__init__(name=name, **kwargs)
+        self.score = self.add_weight(name='eer', initializer='zeros')
+        self.count = self.add_weight(name='count', initializer='zeros')
+    def update_state(self, y_true, y_pred):
+        self.score.assign_add(tf.reduce_sum(tf.py_function(func=compute_eer, inp=[y_true, y_pred], Tout=tf.float32,  name='compute_eer')))
+        self.count.assign_add(1)
+    def result(self):
+        return tf.math.divide_no_nan(self.score, self.count)

dataset/__pycache__/dataloader_demo.cpython-37.pyc ADDED Viewed

Binary file (7.73 kB). View file

dataset/__pycache__/dataloader_infe.cpython-37.pyc ADDED Viewed

Binary file (6.73 kB). View file

dataset/__pycache__/google.cpython-37.pyc ADDED Viewed

Binary file (8.57 kB). View file

dataset/__pycache__/google_infe202405.cpython-37.pyc ADDED Viewed

Binary file (8.64 kB). View file

dataset/__pycache__/libriphrase.cpython-37.pyc ADDED Viewed

Binary file (13.6 kB). View file

dataset/__pycache__/libriphrase_ctc1.cpython-37.pyc ADDED Viewed

Binary file (14.3 kB). View file

dataset/__pycache__/qualcomm.cpython-37.pyc ADDED Viewed

Binary file (8.06 kB). View file

dataset/dataloader_demo.py ADDED Viewed

	@@ -0,0 +1,182 @@

+import math, os, re, sys
+from pathlib import Path
+import numpy as np
+import pandas as pd
+from multiprocessing import Pool
+from scipy.io import wavfile
+import tensorflow as tf
+from pydub import AudioSegment
+from tensorflow.keras.utils import Sequence, OrderedEnqueuer
+from tensorflow.keras import layers
+from tensorflow.keras.preprocessing.sequence import pad_sequences
+sys.path.append(os.path.dirname(__file__))
+from g2p.g2p_en.g2p import G2p
+import warnings
+warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
+np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
+class GoogleCommandsDataloader(Sequence):
+    def __init__(self,
+                 batch_size,
+                 fs = 16000,
+                 keyword=['realtek go','ok google','vintage','hackney','crocodile','surroundings','oversaw','northwestern'],
+                 wav_path_or_object='/share/nas165/yiting/recording/ok_google/Default_20240725-183008.wav',
+                 features='g2p_embed', # phoneme, g2p_embed, both ...
+                 ):
+        phonemes = ["<pad>", ] + ['AA0', 'AA1', 'AA2', 'AE0', 'AE1', 'AE2', 'AH0', 'AH1', 'AH2', 'AO0',
+                                    'AO1', 'AO2', 'AW0', 'AW1', 'AW2', 'AY0', 'AY1', 'AY2', 'B', 'CH',
+                                    'D', 'DH', 'EH0', 'EH1', 'EH2', 'ER0', 'ER1', 'ER2', 'EY0', 'EY1',
+                                    'EY2', 'F', 'G', 'HH', 'IH0', 'IH1', 'IH2', 'IY0', 'IY1', 'IY2',
+                                    'JH', 'K', 'L', 'M', 'N', 'NG', 'OW0', 'OW1', 'OW2', 'OY0',
+                                    'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH0', 'UH1',
+                                    'UH2', 'UW', 'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH',
+                                    ' ']
+        self.p2idx = {p: idx for idx, p in enumerate(phonemes)}
+        self.idx2p = {idx: p for idx, p in enumerate(phonemes)}
+        self.batch_size = batch_size
+        self.fs = fs
+        self.features = features
+        self.nPhoneme = len(phonemes)
+        self.g2p = G2p()
+        self.keyword = keyword
+        self.wav = wav_path_or_object
+        self.__prep__()
+        self.on_epoch_end()
+    def __prep__(self):
+        self.data = pd.DataFrame(columns=['wav', 'text', 'duration', 'label'])
+        anchor = ' '
+        target_dict = {}
+        if isinstance(self.wav, str):
+            anchor  = self.wav.split('/')[-2].lower().replace('_', ' ')
+            duration = float(wavfile.read(self.wav)[1].shape[-1])/self.fs
+        else:
+            duration = float(self.wav[1].shape[-1])/self.fs
+        # duration = float(wavfile.read(self.wav)[1].shape[-1])/self.fs
+        # duration = float(self.wav_path_or_object.shape[-1])/self.fs
+        for i, comparison_text in enumerate(self.keyword):
+            label = 1 if comparison_text == anchor else 0
+            target_dict[i] = {
+                'wav': self.wav,
+                'text': comparison_text,
+                'duration': duration,
+                'label': label
+            }
+        print(target_dict)
+        self.data = self.data.append(pd.DataFrame.from_dict(target_dict, 'index'), ignore_index=True)
+        print(self.data)
+        # g2p & p2idx by g2p_en package
+        print(">> Convert word to phoneme")
+        self.data['phoneme'] = self.data['text'].apply(lambda x: self.g2p(re.sub(r"[^a-zA-Z0-9]+", ' ', x)))
+        print(">> Convert phoneme to index")
+        self.data['pIndex'] = self.data['phoneme'].apply(lambda x: [self.p2idx[t] for t in x])
+        print(">> Compute phoneme embedding")
+        self.data['g2p_embed'] = self.data['text'].apply(lambda x: self.g2p.embedding(x))
+        # if (self.pkl is not None) and (not os.path.isfile(self.pkl)):
+        #     self.data.to_pickle(self.pkl)
+        # Get longest data
+        self.wav_list = self.data['wav'].values
+        self.idx_list = self.data['pIndex'].values
+        # self.idx_list = [np.insert(lst, 0, 0) for lst in self.idx_list]
+        # self.sIdx_list = [np.insert(lst, 0, 0) for lst in self.sIdx_list]
+        self.emb_list = self.data['g2p_embed'].values
+        self.lab_list = self.data['label'].values
+        self.data = self.data.sort_values(by='duration').reset_index(drop=True)
+        # Set dataloader params.
+        self.len = len(self.data)
+        self.maxlen_t = int((int(self.data['text'].apply(lambda x: len(x)).max() / 10) + 1) * 10)
+        # self.maxlen_a = int(((int(self.data['duration'].values[-1] / 0.5) + 1 ) * self.fs / 2)*1.2)
+        # print(self.maxlen_a)
+        self.maxlen_a = 56000
+    def __len__(self):
+        # return total batch-wise length
+        return math.ceil(self.len / self.batch_size)
+    def _load_wav(self, wav):
+        return np.array(wavfile.read(wav)[1]).astype(np.float32) / 32768.0
+    def __getitem__(self, idx):
+        # chunking
+        indices = self.indices[idx * self.batch_size : (idx + 1) * self.batch_size]
+        # load inputs
+        if isinstance(self.wav, str):
+            batch_x = [np.array(wavfile.read(self.wav_list[i])[1]).astype(np.float32) / 32768.0 for i in indices]
+        else:
+            batch_x = [np.array((self.wav_list[i])[1]).astype(np.float32)/ 32768.0 for i in indices]
+        # batch_x = [np.array(wavfile.read(self.wav_list[i])[1]).astype(np.float32) / 32768.0 for i in indices]
+        if self.features == 'both':
+            batch_p = [np.array(self.idx_list[i]).astype(np.int32) for i in indices]
+            batch_e = [np.array(self.emb_list[i]).astype(np.float32) for i in indices]
+        else:
+            if self.features == 'phoneme':
+                batch_y = [np.array(self.idx_list[i]).astype(np.int32) for i in indices]
+            elif self.features == 'g2p_embed':
+                batch_y = [np.array(self.emb_list[i]).astype(np.float32) for i in indices]
+        # load outputs
+        batch_z = [np.array([self.lab_list[i]]).astype(np.float32) for i in indices]
+        # padding and masking
+        pad_batch_x = pad_sequences(np.array(batch_x), maxlen=self.maxlen_a, value=0.0, padding='post', dtype=batch_x[0].dtype)
+        if self.features == 'both':
+            pad_batch_p = pad_sequences(np.array(batch_p), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_p[0].dtype)
+            pad_batch_e = pad_sequences(np.array(batch_e), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_e[0].dtype)
+        else:
+            pad_batch_y = pad_sequences(np.array(batch_y), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_y[0].dtype)
+        pad_batch_z = pad_sequences(np.array(batch_z), value=0.0, padding='post', dtype=batch_z[0].dtype)
+        if self.features == 'both':
+            return pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z
+        else:
+            return pad_batch_x, pad_batch_y, pad_batch_z
+    def on_epoch_end(self):
+        self.indices = np.arange(self.len)
+        # if self.shuffle == True:
+        #     np.random.shuffle(self.indices)
+def convert_sequence_to_dataset(dataloader):
+    def data_generator():
+        for i in range(dataloader.__len__()):
+            if dataloader.features == 'both':
+                pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z = dataloader[i]
+                yield pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z
+            else:
+                pad_batch_x, pad_batch_y, pad_batch_z = dataloader[i]
+                yield pad_batch_x, pad_batch_y, pad_batch_z
+    if dataloader.features == 'both':
+        data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+            tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, dataloader.maxlen_t), dtype=tf.int32),
+            tf.TensorSpec(shape=(None, dataloader.maxlen_t, 256), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+        )
+    else:
+        data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+            tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, dataloader.maxlen_t) if dataloader.features == 'phoneme' else (None, dataloader.maxlen_t, 256),
+                        dtype=tf.int32 if dataloader.features == 'phoneme' else tf.float32),
+            tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+        )
+    # data_dataset = data_dataset.cache()
+    # data_dataset = tf.data.Dataset.from_generator(data_generator, output_signature=output_signature)
+    data_dataset = data_dataset.prefetch(1)
+    return data_dataset
+if __name__ == '__main__':
+    dataloader = GoogleCommandsDataloader(2048, testset_only=True, pkl='/home/DB/google_speech_commands/google_testset.pkl', features='g2p_embed')

dataset/dataloader_infe.py ADDED Viewed

	@@ -0,0 +1,164 @@

+import math, os, re, sys
+from pathlib import Path
+import numpy as np
+import pandas as pd
+from multiprocessing import Pool
+from scipy.io import wavfile
+import tensorflow as tf
+from tensorflow.keras.utils import Sequence, OrderedEnqueuer
+from tensorflow.keras import layers
+from tensorflow.keras.preprocessing.sequence import pad_sequences
+sys.path.append(os.path.dirname(__file__))
+from g2p.g2p_en.g2p import G2p
+import warnings
+warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
+np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
+def dataloader(fs = 16000,keyword='',wav_path_or_object=None,g2p=None,
+             features='both' # phoneme, g2p_embed, both ...
+             ):
+    phonemes = ["<pad>", ] + ['AA0', 'AA1', 'AA2', 'AE0', 'AE1', 'AE2', 'AH0', 'AH1', 'AH2', 'AO0',
+                                'AO1', 'AO2', 'AW0', 'AW1', 'AW2', 'AY0', 'AY1', 'AY2', 'B', 'CH',
+                                'D', 'DH', 'EH0', 'EH1', 'EH2', 'ER0', 'ER1', 'ER2', 'EY0', 'EY1',
+                                'EY2', 'F', 'G', 'HH', 'IH0', 'IH1', 'IH2', 'IY0', 'IY1', 'IY2',
+                                'JH', 'K', 'L', 'M', 'N', 'NG', 'OW0', 'OW1', 'OW2', 'OY0',
+                                'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH0', 'UH1',
+                                'UH2', 'UW', 'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH',
+                                ' ']
+    p2idx = {p: idx for idx, p in enumerate(phonemes)}
+    idx2p = {idx: p for idx, p in enumerate(phonemes)}
+    fs = fs
+    wav_path_or_object = wav_path_or_object
+    keyword = keyword
+    features = features
+    # g2p = G2p()
+    data = pd.DataFrame(columns=['wav','wav_label', 'text', 'duration', 'label'])
+    target_dict = {}
+    idx = 0
+    wav = wav_path_or_object
+    keyword = keyword
+    if isinstance(wav_path_or_object, str):
+        duration = float(wavfile.read(wav)[1].shape[-1])/fs
+    else:
+        duration = float(wav_path_or_object.shape[-1])/fs
+    label = 1
+    anchor_text = wav.split('/')[-2].lower()
+    target_dict[idx] = {
+        'wav': wav,
+        'wav_label': anchor_text,
+        'text': keyword,
+        'duration': duration,
+        'label': label
+        }
+    data = data.append(pd.DataFrame.from_dict(target_dict, 'index'), ignore_index=True)
+    # g2p & p2idx by g2p_en package
+    # print(">> Convert word to phoneme")
+    data['phoneme'] = data['text'].apply(lambda x: g2p(re.sub(r"[^a-zA-Z0-9]+", ' ', x)))
+    # print(">> Convert phoneme to index")
+    data['pIndex'] = data['phoneme'].apply(lambda x: [p2idx[t] for t in x])
+    # print(">> Compute phoneme embedding")
+    data['g2p_embed'] = data['text'].apply(lambda x: g2p.embedding(x))
+    data['wav_phoneme'] = data['wav_label'].apply(lambda x: g2p(re.sub(r"[^a-zA-Z0-9]+", ' ', x)))
+    data['wav_pIndex'] = data['wav_phoneme'].apply(lambda x: [p2idx[t] for t in x])
+    # print(data['phoneme'])
+    # Get longest data
+    data = data.sort_values(by='duration').reset_index(drop=True)
+    wav_list = data['wav'].values
+    idx_list = data['pIndex'].values
+    emb_list = data['g2p_embed'].values
+    lab_list = data['label'].values
+    sIdx_list = data['wav_pIndex'].values
+    # Set dataloader params.
+    # len = len(data)
+    maxlen_t = int((int(data['text'].apply(lambda x: len(x)).max() / 10) + 1) * 10)
+    maxlen_a = int((int(data['duration'].values[-1] / 0.5) + 1 ) * fs / 2)
+    maxlen_l = int((int(data['wav_label'].apply(lambda x: len(x)).max() / 10) + 1) * 10)
+    indices = [0]
+    # load inputs
+    if isinstance(wav_path_or_object, str):
+        batch_x = [np.array(wavfile.read(wav_list[i])[1]).astype(np.float32) / 32768.0 for i in indices]
+    else:
+        batch_x = [wav_list[i] / 32768.0 for i in indices]
+    if features == 'both':
+        batch_p = [np.array(idx_list[i]).astype(np.int32) for i in indices]
+        batch_e = [np.array(emb_list[i]).astype(np.float32) for i in indices]
+    else:
+        if features == 'phoneme':
+            batch_y = [np.array(idx_list[i]).astype(np.int32) for i in indices]
+        elif features == 'g2p_embed':
+            batch_y = [np.array(emb_list[i]).astype(np.float32) for i in indices]
+    # load outputs
+    batch_z = [np.array([lab_list[i]]).astype(np.float32) for i in indices]
+    batch_l = [np.array(sIdx_list[i]).astype(np.int32) for i in indices]
+    # padding and masking
+    pad_batch_x = pad_sequences(np.array(batch_x), maxlen=maxlen_a, value=0.0, padding='post', dtype=batch_x[0].dtype)
+    if features == 'both':
+        pad_batch_p = pad_sequences(np.array(batch_p), maxlen=maxlen_t, value=0.0, padding='post', dtype=batch_p[0].dtype)
+        pad_batch_e = pad_sequences(np.array(batch_e), maxlen=maxlen_t, value=0.0, padding='post', dtype=batch_e[0].dtype)
+    else:
+        pad_batch_y = pad_sequences(np.array(batch_y), maxlen=maxlen_t, value=0.0, padding='post', dtype=batch_y[0].dtype)
+    pad_batch_z = pad_sequences(np.array(batch_z), value=0.0, padding='post', dtype=batch_z[0].dtype)
+    pad_batch_l = pad_sequences(np.array(batch_l), maxlen=maxlen_l, value=0.0, padding='post', dtype=batch_l[0].dtype)
+    if features == 'both':
+        return pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z,batch_l
+    else:
+        return pad_batch_x, pad_batch_y, pad_batch_z,batch_l
+# def _load_wav(self, wav):
+#     return np.array(wavfile.read(wav)[1]).astype(np.float32) / 32768.0
+def convert_sequence_to_dataset(dataloader, wav, text, features):
+    fs = 16000
+    features=features
+    duration = float(wavfile.read(wav)[1].shape[-1])/fs
+    maxlen_t = int((int(len(text) / 10) + 1) * 10)
+    maxlen_a = int((int(duration / 0.5) + 1 ) * fs / 2)
+    wav_label = wav.split('/')[-2].lower()
+    def data_generator():
+        if features == 'both':
+            pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z, pad_batch_l = dataloader
+            yield pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z, pad_batch_l
+        else:
+            pad_batch_x, pad_batch_y, pad_batch_z, pad_batch_l = dataloader
+            yield pad_batch_x, pad_batch_y, pad_batch_z, pad_batch_l
+    if features == 'both':
+        data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+            tf.TensorSpec(shape=(None, maxlen_a), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, maxlen_t), dtype=tf.int32),
+            tf.TensorSpec(shape=(None, maxlen_t, 256), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, 1), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, None), dtype=tf.int32),)
+        )
+    else:
+        data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+            tf.TensorSpec(shape=(None, maxlen_a), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, maxlen_t) if features == 'phoneme' else (None, maxlen_t, 256),
+                        dtype=tf.int32 if features == 'phoneme' else tf.float32),
+            tf.TensorSpec(shape=(None, 1), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, None), dtype=tf.int32),)
+        )
+    # data_dataset = data_dataset.cache()
+    data_dataset = data_dataset.prefetch(1)
+    return data_dataset

dataset/g2p/LICENSE.txt ADDED Viewed

	@@ -0,0 +1,201 @@

+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+   1. Definitions.
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+   END OF TERMS AND CONDITIONS
+   APPENDIX: How to apply the Apache License to your work.
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "{}"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+   Copyright {yyyy} {name of copyright owner}
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+       http://www.apache.org/licenses/LICENSE-2.0
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

dataset/g2p/g2p_en/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ from .g2p import G2p

dataset/g2p/g2p_en/__pycache__/__init__.cpython-37.pyc ADDED Viewed

Binary file (186 Bytes). View file

dataset/g2p/g2p_en/__pycache__/expand.cpython-37.pyc ADDED Viewed

Binary file (2.39 kB). View file

dataset/g2p/g2p_en/__pycache__/g2p.cpython-37.pyc ADDED Viewed

Binary file (8.05 kB). View file

dataset/g2p/g2p_en/checkpoint20.npz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b8af35e4596d8dd5836dfd3fe9b2ba4f97b9c311efe8879544cbcfcbd566d8c6
+size 3342298

dataset/g2p/g2p_en/expand.py ADDED Viewed

	@@ -0,0 +1,79 @@

+# -*- coding: utf-8 -*-
+#/usr/bin/python2
+'''
+Borrowed
+from https://github.com/keithito/tacotron/blob/master/text/numbers.py
+By kyubyong park. [email protected].
+https://www.github.com/kyubyong/g2p
+'''
+from __future__ import print_function
+import inflect
+import re
+_inflect = inflect.engine()
+_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
+_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
+_pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
+_dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
+_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
+_number_re = re.compile(r'[0-9]+')
+def _remove_commas(m):
+    return m.group(1).replace(',', '')
+def _expand_decimal_point(m):
+    return m.group(1).replace('.', ' point ')
+def _expand_dollars(m):
+    match = m.group(1)
+    parts = match.split('.')
+    if len(parts) > 2:
+        return match + ' dollars'    # Unexpected format
+    dollars = int(parts[0]) if parts[0] else 0
+    cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
+    if dollars and cents:
+        dollar_unit = 'dollar' if dollars == 1 else 'dollars'
+        cent_unit = 'cent' if cents == 1 else 'cents'
+        return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
+    elif dollars:
+        dollar_unit = 'dollar' if dollars == 1 else 'dollars'
+        return '%s %s' % (dollars, dollar_unit)
+    elif cents:
+        cent_unit = 'cent' if cents == 1 else 'cents'
+        return '%s %s' % (cents, cent_unit)
+    else:
+        return 'zero dollars'
+def _expand_ordinal(m):
+    return _inflect.number_to_words(m.group(0))
+def _expand_number(m):
+    num = int(m.group(0))
+    if num > 1000 and num < 3000:
+        if num == 2000:
+            return 'two thousand'
+        elif num > 2000 and num < 2010:
+            return 'two thousand ' + _inflect.number_to_words(num % 100)
+        elif num % 100 == 0:
+            return _inflect.number_to_words(num // 100) + ' hundred'
+        else:
+            return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
+    else:
+        return _inflect.number_to_words(num, andword='')
+def normalize_numbers(text):
+    text = re.sub(_comma_number_re, _remove_commas, text)
+    text = re.sub(_pounds_re, r'\1 pounds', text)
+    text = re.sub(_dollars_re, _expand_dollars, text)
+    text = re.sub(_decimal_number_re, _expand_decimal_point, text)
+    text = re.sub(_ordinal_re, _expand_ordinal, text)
+    text = re.sub(_number_re, _expand_number, text)
+    return text

dataset/g2p/g2p_en/g2p.py ADDED Viewed

	@@ -0,0 +1,249 @@

+# -*- coding: utf-8 -*-
+# /usr/bin/python
+'''
+By kyubyong park([email protected]) and Jongseok Kim(https://github.com/ozmig77)
+https://www.github.com/kyubyong/g2p
+'''
+from nltk import pos_tag
+from nltk.corpus import cmudict
+import nltk
+from nltk.tokenize import TweetTokenizer
+word_tokenize = TweetTokenizer().tokenize
+import numpy as np
+import codecs
+import re
+import os, sys
+import unicodedata
+from builtins import str as unicode
+sys.path.append(os.path.dirname(__file__))
+from expand import normalize_numbers
+try:
+    nltk.data.find('taggers/averaged_perceptron_tagger.zip')
+except LookupError:
+    nltk.download('averaged_perceptron_tagger')
+try:
+    nltk.data.find('corpora/cmudict.zip')
+except LookupError:
+    nltk.download('cmudict')
+dirname = os.path.dirname(__file__)
+def construct_homograph_dictionary():
+    f = os.path.join(dirname,'homographs.en')
+    homograph2features = dict()
+    for line in codecs.open(f, 'r', 'utf8').read().splitlines():
+        if line.startswith("#"): continue # comment
+        headword, pron1, pron2, pos1 = line.strip().split("|")
+        homograph2features[headword.lower()] = (pron1.split(), pron2.split(), pos1)
+    return homograph2features
+# def segment(text):
+#     '''
+#     Splits text into `tokens`.
+#     :param text: A string.
+#     :return: A list of tokens (string).
+#     '''
+#     print(text)
+#     text = re.sub('([.,?!]( |$))', r' \1', text)
+#     print(text)
+#     return text.split()
+class G2p(object):
+    def __init__(self):
+        super().__init__()
+        self.graphemes = ["<pad>", "<unk>", "</s>"] + list("abcdefghijklmnopqrstuvwxyz")
+        self.phonemes = ["<pad>", "<unk>", "<s>", "</s>"] + ['AA0', 'AA1', 'AA2', 'AE0', 'AE1', 'AE2', 'AH0', 'AH1', 'AH2', 'AO0',
+                                                             'AO1', 'AO2', 'AW0', 'AW1', 'AW2', 'AY0', 'AY1', 'AY2', 'B', 'CH', 'D', 'DH',
+                                                             'EH0', 'EH1', 'EH2', 'ER0', 'ER1', 'ER2', 'EY0', 'EY1',
+                                                             'EY2', 'F', 'G', 'HH',
+                                                             'IH0', 'IH1', 'IH2', 'IY0', 'IY1', 'IY2', 'JH', 'K', 'L',
+                                                             'M', 'N', 'NG', 'OW0', 'OW1',
+                                                             'OW2', 'OY0', 'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH',
+                                                             'UH0', 'UH1', 'UH2', 'UW',
+                                                             'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH']
+        self.g2idx = {g: idx for idx, g in enumerate(self.graphemes)}
+        self.idx2g = {idx: g for idx, g in enumerate(self.graphemes)}
+        self.p2idx = {p: idx for idx, p in enumerate(self.phonemes)}
+        self.idx2p = {idx: p for idx, p in enumerate(self.phonemes)}
+        self.cmu = cmudict.dict()
+        self.load_variables()
+        self.homograph2features = construct_homograph_dictionary()
+    def load_variables(self):
+        self.variables = np.load(os.path.join(dirname,'checkpoint20.npz'))
+        self.enc_emb = self.variables["enc_emb"]  # (29, 64). (len(graphemes), emb)
+        self.enc_w_ih = self.variables["enc_w_ih"]  # (3*128, 64)
+        self.enc_w_hh = self.variables["enc_w_hh"]  # (3*128, 128)
+        self.enc_b_ih = self.variables["enc_b_ih"]  # (3*128,)
+        self.enc_b_hh = self.variables["enc_b_hh"]  # (3*128,)
+        self.dec_emb = self.variables["dec_emb"]  # (74, 64). (len(phonemes), emb)
+        self.dec_w_ih = self.variables["dec_w_ih"]  # (3*128, 64)
+        self.dec_w_hh = self.variables["dec_w_hh"]  # (3*128, 128)
+        self.dec_b_ih = self.variables["dec_b_ih"]  # (3*128,)
+        self.dec_b_hh = self.variables["dec_b_hh"]  # (3*128,)
+        self.fc_w = self.variables["fc_w"]  # (74, 128)
+        self.fc_b = self.variables["fc_b"]  # (74,)
+    def sigmoid(self, x):
+        return 1 / (1 + np.exp(-x))
+    def grucell(self, x, h, w_ih, w_hh, b_ih, b_hh):
+        rzn_ih = np.matmul(x, w_ih.T) + b_ih
+        rzn_hh = np.matmul(h, w_hh.T) + b_hh
+        rz_ih, n_ih = rzn_ih[:, :rzn_ih.shape[-1] * 2 // 3], rzn_ih[:, rzn_ih.shape[-1] * 2 // 3:]
+        rz_hh, n_hh = rzn_hh[:, :rzn_hh.shape[-1] * 2 // 3], rzn_hh[:, rzn_hh.shape[-1] * 2 // 3:]
+        rz = self.sigmoid(rz_ih + rz_hh)
+        r, z = np.split(rz, 2, -1)
+        n = np.tanh(n_ih + r * n_hh)
+        h = (1 - z) * n + z * h
+        return h
+    def gru(self, x, steps, w_ih, w_hh, b_ih, b_hh, h0=None):
+        if h0 is None:
+            h0 = np.zeros((x.shape[0], w_hh.shape[1]), np.float32)
+        h = h0  # initial hidden state
+        outputs = np.zeros((x.shape[0], steps, w_hh.shape[1]), np.float32)
+        for t in range(steps):
+            h = self.grucell(x[:, t, :], h, w_ih, w_hh, b_ih, b_hh)  # (b, h)
+            outputs[:, t, ::] = h
+        return outputs
+    def encode(self, word):
+        chars = list(word) + ["</s>"]
+        x = [self.g2idx.get(char, self.g2idx["<unk>"]) for char in chars]
+        x = np.take(self.enc_emb, np.expand_dims(x, 0), axis=0)
+        return x
+    def predict(self, word):
+        # encoder
+        enc = self.encode(word)
+        enc = self.gru(enc, len(word) + 1, self.enc_w_ih, self.enc_w_hh,
+                       self.enc_b_ih, self.enc_b_hh, h0=np.zeros((1, self.enc_w_hh.shape[-1]), np.float32))
+        last_hidden = enc[:, -1, :]
+        # decoder
+        dec = np.take(self.dec_emb, [2], axis=0)  # 2: <s>
+        h = last_hidden
+        preds = []
+        for i in range(20):
+            h = self.grucell(dec, h, self.dec_w_ih, self.dec_w_hh, self.dec_b_ih, self.dec_b_hh)  # (b, h)
+            logits = np.matmul(h, self.fc_w.T) + self.fc_b
+            pred = logits.argmax()
+            if pred == 3: break  # 3: </s>
+            preds.append(pred)
+            dec = np.take(self.dec_emb, [pred], axis=0)
+        preds = [self.idx2p.get(idx, "<unk>") for idx in preds]
+        return preds
+    def __call__(self, text):
+        # preprocessing
+        text = unicode(text)
+        text = normalize_numbers(text)
+        text = ''.join(char for char in unicodedata.normalize('NFD', text)
+                       if unicodedata.category(char) != 'Mn')  # Strip accents
+        text = text.lower()
+        text = text.replace("_", " ")
+        text = re.sub("[^ a-z'.,?!\-]", "", text)
+        text = text.replace("i.e.", "that is")
+        text = text.replace("e.g.", "for example")
+        # tokenization
+        words = word_tokenize(text)
+        tokens = pos_tag(words)  # tuples of (word, tag)
+        # steps
+        prons = []
+        for word in words:
+            if re.search("[a-z]", word) is None:
+                continue
+            # elif word in self.homograph2features:  # Check homograph
+            #     pron1, pron2, pos1 = self.homograph2features[word]
+            #     if pos.startswith(pos1):
+            #         pron = pron1
+            #     else:
+            #         pron = pron2
+            # elif word in self.cmu:  # lookup CMU dict
+            #     pron = self.cmu[word][0]
+            # else: # predict for oov
+            pron = self.predict(word)
+            prons.extend(pron)
+            prons.extend([" "])
+        return prons[:-1]
+    def embedding(self, text):
+        # preprocessing
+        text = unicode(text)
+        text = normalize_numbers(text)
+        text = ''.join(char for char in unicodedata.normalize('NFD', text)
+                       if unicodedata.category(char) != 'Mn')  # Strip accents
+        text = text.lower()
+        text = re.sub("[^ a-z'.,?!\-]", "", text)
+        text = text.replace("i.e.", "that is")
+        text = text.replace("e.g.", "for example")
+        # tokenization
+        words = word_tokenize(text)
+        # embedding func.
+        def _get(self, word):
+            # encoder
+            enc = self.encode(word)
+            enc = self.gru(enc, len(word) + 1, self.enc_w_ih, self.enc_w_hh,
+                        self.enc_b_ih, self.enc_b_hh, h0=np.zeros((1, self.enc_w_hh.shape[-1]), np.float32))
+            last_hidden = enc[:, -1, :]
+            # decoder
+            dec = np.take(self.dec_emb, [2], axis=0)  # 2: <s>
+            h = last_hidden
+            preds = []
+            emb = np.empty((0, self.dec_emb[0,:].shape[-1]))
+            for i in range(20):
+                h = self.grucell(dec, h, self.dec_w_ih, self.dec_w_hh, self.dec_b_ih, self.dec_b_hh)  # (b, h)
+                logits = np.matmul(h, self.fc_w.T) + self.fc_b
+                pred = logits.argmax()
+                if pred == 3: break  # 3: </s>
+                dec = np.take(self.dec_emb, [pred], axis=0)
+                emb = np.append(emb, h, axis=0)
+            return emb
+        # steps
+        embed = np.empty((0, self.dec_emb[0,:].shape[-1]))
+        for word in words:
+            if re.search("[a-z]", word) is None:
+                continue
+            embed = np.append(embed, _get(self, word), axis=0)
+            embed = np.append(embed, np.take(self.dec_emb, [0], axis=0), axis=0)
+        return embed[:-1,:]
+if __name__ == '__main__':
+    texts = ['yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go', 'hey_android', 'hey_snapdragon', 'hi_galaxy', 'hi_lumina']
+            # "I have $250 in my pocket.", # number -> spell-out
+            #  "popular pets, e.g. cats and dogs", # e.g. -> for example
+            #  "I refuse to collect the refuse around here.", # homograph
+            #  "I'm an activationist."] # newly coined word
+    g2p = G2p()
+    for text in texts:
+        out = g2p(text)
+        emb = g2p.embedding(text)
+        print(out)
+        print(emb.shape)

dataset/g2p/g2p_en/homographs.en ADDED Viewed

	@@ -0,0 +1,379 @@

+#This is based on http://www minpairs talktalk net/graph html
+#Each line is formatted as follows:
+#HEADWORD|PRONUNCIATION1|PRONUNCIATION2|POS
+#HEADWORD should have PRONUNCIATION1 only if it's part-of-speech is POS
+#Otherwise PRONUNCIATION2 is applied
+#May, 2018
+#Kyubyong Park
+#https://github|com/kyubyong/g2p
+ABSENT|AH1 B S AE1 N T|AE1 B S AH0 N T|V
+ABSTRACT|AE0 B S T R AE1 K T|AE1 B S T R AE2 K T|V
+ABSTRACTS|AE0 B S T R AE1 K T S|AE1 B S T R AE0 K T S|V
+ABUSE|AH0 B Y UW1 Z|AH0 B Y UW1 S|V
+ABUSES|AH0 B Y UW1 Z IH0 Z|AH0 B Y UW1 S IH0 Z|V
+ACCENT|AH0 K S EH1 N T|AE1 K S EH2 N T|V
+ACCENTS|AE1 K S EH0 N T S|AE1 K S EH0 N T S|V
+ADDICT|AH0 D IH1 K T|AE1 D IH2 K T|V
+ADDICTS|AH0 D IH1 K T S|AE1 D IH2 K T S|V
+ADVOCATE|AE1 D V AH0 K EY2 T|AE1 D V AH0 K AH0 T|V
+ADVOCATES|AE1 D V AH0 K EY2 T S|AE1 D V AH0 K AH0 T S|V
+AFFECT|AH0 F EH1 K T|AE1 F EH0 K T|V
+AFFECTS|AH0 F EH1 K T S|AE1 F EH0 K T S|V
+AFFIX|AH0 F IH1 K S|AE1 F IH0 K S|V
+AFFIXES|AH0 F IH1 K S IH0 Z|AE1 F IH0 K S IH0 Z|V
+AGGLOMERATE|AH0 G L AA1 M ER0 EY2 T|AH0 G L AA1 M ER0 AH0 T|V
+AGGREGATE|AE1 G R AH0 G EY0 T|AE1 G R AH0 G AH0 T|V
+AGGREGATES|AE1 G R AH0 G EY2 T S|AE1 G R AH0 G IH0 T S|V
+ALLIES|AH0 L AY1 Z|AE1 L AY0 Z|V
+ALLOY|AH0 L OY1|AE1 L OY2|V
+ALLOYS|AH0 L OY1 Z|AE1 L OY2 Z|V
+ALLY|AH0 L AY1|AE1 L AY0|V
+ALTERNATE|AO1 L T ER0 N EY2 T|AO0 L T ER1 N AH0 T|V
+ANALYSES|AH0 N AE1 L IH0 S IY2 Z|AE1 N AH0 L AY0 Z IH2 Z|V
+ANIMATE|AE1 N AH0 M EY2 T|AE1 N AH0 M AH0 T|V
+ANNEX|AH0 N EH1 K S|AE1 N EH2 K S|V
+ANNEXES|AH0 N EH1 K S IH0 Z|AE1 N EH2 K S IH0 Z|V
+APPROPRIATE|AH0 P R OW1 P R IY0 EY2 T|AH0 P R OW1 P R IY0 AH0 T|V
+APPROXIMATE|AH0 P R AA1 K S AH0 M EY2 T|AH0 P R AA1 K S AH0 M AH0 T|V
+ARTICULATE|AA0 R T IH1 K Y AH0 L AH0 T|AA0 R T IH1 K Y AH0 L EY2 T|V
+ASPIRATE|AE1 S P ER0 EY2 T|AE1 S P ER0 AH0 T|V
+ASPIRATES|AE1 S P ER0 EY2 T S|AE1 S P ER0 AH0 T S|V
+ASSOCIATE|AH0 S OW1 S IY0 EY2 T|AH0 S OW1 S IY0 AH0 T|V
+ASSOCIATES|AH0 S OW1 S IY0 EY2 T S|AH0 S OW1 S IY0 AH0 T S|V
+ATTRIBUTE|AH0 T R IH1 B Y UW2 T|AE1 T R IH0 B Y UW0 T|V
+ATTRIBUTES|AH0 T R IH1 B Y UW2 T S|AE1 T R IH0 B Y UW0 T S|V
+BATHS|B AE1 TH S|B AE1 DH Z|V
+BLESSED|B L EH1 S IH0 D|B L EH1 S T|V
+CERTIFICATE|S ER0 T IH1 F IH0 K AH0 T|S ER0 T IH1 F IH0 K EY2 T|V
+CERTIFICATES|S ER0 T IH1 F IH0 K EY2 T S|S ER0 T IH1 F IH0 K AH0 T S|V
+CLOSE|K L OW1 Z|K L OW1 S|V
+CLOSER|K L OW1 Z ER0|K L OW1 S ER0|N
+CLOSES|K L OW1 Z IH0 Z|K L OW1 S IH0 Z|V
+COLLECT|K AH0 L EH1 K T|K AA1 L EH0 K T|V
+COLLECTS|K AH0 L EH1 K T S|K AA1 L EH0 K T S|V
+COMBAT|K AH0 M B AE1 T|K AA1 M B AE0 T|V
+COMBATS|K AH0 M B AE1 T S|K AH1 M B AE0 T S|V
+COMBINE|K AH0 M B AY1 N|K AA1 M B AY0 N|V
+COMMUNE|K AH0 M Y UW1 N|K AA1 M Y UW0 N|V
+COMMUNES|K AH0 M Y UW1 N Z|K AA1 M Y UW0 N Z|V
+COMPACT|K AH0 M P AE1 K T|K AA1 M P AE0 K T|V
+COMPACTS|K AH0 M P AE1 K T S|K AA1 M P AE0 K T S|V
+COMPLEX|K AH0 M P L EH1 K S| K AA1 M P L EH0 K S|ADJ
+COMPLIMENT|K AA1 M P L AH0 M EH0 N T|K AA1 M P L AH0 M AH0 N T|V
+COMPLIMENTS|K AA1 M P L AH0 M EH0 N T S|K AA1 M P L AH0 M AH0 N T S|V
+COMPOUND|K AH0 M P AW1 N D|K AA1 M P AW0 N D|V
+COMPOUNDS|K AH0 M P AW1 N D Z|K AA1 M P AW0 N D Z|V
+COMPRESS|K AH0 M P R EH1 S|K AA1 M P R EH0 S|V
+COMPRESSES|K AH0 M P R EH1 S IH0 Z|K AA1 M P R EH0 S AH0 Z|V
+CONCERT|K AH0 N S ER1 T|K AA1 N S ER0 T|V
+CONCERTS|K AH0 N S ER1 T S|K AA1 N S ER0 T S|V
+CONDUCT|K AA0 N D AH1 K T|K AA1 N D AH0 K T|V
+CONFEDERATE|K AH0 N F EH1 D ER0 EY2 T|K AH0 N F EH1 D ER0 AH0 T|V
+CONFEDERATES|K AH0 N F EH1 D ER0 EY2 T S|K AH0 N F EH1 D ER0 AH0 T S|V
+CONFINES|K AH0 N F AY1 N Z|K AA1 N F AY2 N Z|V
+CONFLICT|K AH0 N F L IH1 K T|K AA1 N F L IH0 K T|V
+CONFLICTS|K AH0 N F L IH1 K T S|K AA1 N F L IH0 K T S|V
+CONGLOMERATE|K AH0 N G L AA1 M ER0 EY2 T|K AH0 N G L AA1 M ER0 AH0 T|V
+CONGLOMERATES|K AH0 N G L AA1 M ER0 EY2 T S|K AH0 N G L AA1 M ER0 AH0 T S|V
+CONSCRIPT|K AH0 N S K R IH1 P T|K AA1 N S K R IH0 P T|V
+CONSCRIPTS|K AH0 N S K R IH1 P T S|K AA1 N S K R IH0 P T S|V
+CONSOLE|K AH0 N S OW1 L|K AA1 N S OW0 L|V
+CONSOLES|K AH0 N S OW1 L Z|K AA1 N S OW0 L Z|V
+CONSORT|K AH0 N S AO1 R T|K AA1 N S AO0 R T|V
+CONSTRUCT|K AH0 N S T R AH1 K T|K AA1 N S T R AH0 K T|V
+CONSTRUCTS|K AH0 N S T R AH1 K T S|K AA1 N S T R AH0 K T S|V
+CONSUMMATE|K AA1 N S AH0 M EY2 T|K AA0 N S AH1 M AH0 T|V
+CONTENT|K AA1 N T EH0 N T|K AH0 N T EH1 N T|N
+CONTENTS|K AH0 N T EH1 N T S|K AA1 N T EH0 N T S|V
+CONTEST|K AH0 N T EH1 S T|K AA1 N T EH0 S T|V
+CONTESTS|K AH0 N T EH1 S T S|K AA1 N T EH0 S T S|V
+CONTRACT|K AH0 N T R AE1 K T|K AA1 N T R AE2 K T|V
+CONTRACTS|K AH0 N T R AE1 K T S|K AA1 N T R AE2 K T S|V
+CONTRAST|K AH0 N T R AE1 S T|K AA1 N T R AE0 S T|V
+CONTRASTS|K AH0 N T R AE1 S T S|K AA1 N T R AE0 S T S|V
+CONVERSE|K AH0 N V ER1 S|K AA1 N V ER0 S|V
+CONVERT|K AH0 N V ER1 T|K AA1 N V ER0 T|V
+CONVERTS|K AH0 N V ER1 T S|K AA1 N V ER0 T S|V
+CONVICT|K AH0 N V IH1 K T|K AA1 N V IH0 K T|V
+CONVICTS|K AH0 N V IH1 K T S|K AA1 N V IH0 K T S|V
+COORDINATE|K OW0 AO1 R D AH0 N EY2 T|K OW0 AO1 R D AH0 N AH0 T|V
+COORDINATES|K OW0 AO1 R D AH0 N EY2 T S|K OW0 AO1 R D AH0 N AH0 T S|V
+COUNTERBALANCE|K AW1 N T ER0 B AE2 L AH0 N S|K AW2 N T ER0 B AE1 L AH0 N S|V
+COUNTERBALANCES|K AW2 N T ER0 B AE1 L AH0 N S IH0 Z|K AW1 N T ER0 B AE2 L AH0 N S IH0 Z|V
+CRABBED|K R AE1 B D|K R AE1 B IH0 D|V
+CROOKED|K R UH1 K T|K R UH1 K AH0 D|V
+CURATE|K Y UH0 R AH1 T|K Y UH1 R AH0 T|V
+CURSED|K ER1 S T|K ER1 S IH0 D|V
+DECOY|D IY0 K OY1|D IY1 K OY0|V
+DECOYS|D IY0 K OY1 Z|D IY1 K OY0 Z|V
+DECREASE|D IH0 K R IY1 S|D IY1 K R IY2 S|V
+DECREASES|D IH0 K R IY1 S IH0 Z|D IY1 K R IY2 S IH0 Z|V
+DEFECT|D IH0 F EH1 K T|D IY1 F EH0 K T|V
+DEFECTS|D IH0 F EH1 K T S|D IY1 F EH0 K T S|V
+DEGENERATE|D IH0 JH EH1 N ER0 EY2 T|D IH0 JH EH1 N ER0 AH0 T|V
+DEGENERATES|D IH0 JH EH1 N ER0 EY2 T S|D IH0 JH EH1 N ER0 AH0 T S|V
+DELEGATE|D EH1 L AH0 G EY2 T|D EH1 L AH0 G AH0 T|V
+DELEGATES|D EH1 L AH0 G EY2 T S|D EH1 L AH0 G AH0 T S|V
+DELIBERATE|D IH0 L IH1 B ER0 EY2 T|D IH0 L IH1 B ER0 AH0 T|V
+DESERT|D IH0 Z ER1 T|D EH1 Z ER0 T|V
+DESERTS|D IH0 Z ER1 T S|D EH1 Z ER0 T S|V
+DESOLATE|D EH1 S AH0 L EY2 T|D EH1 S AH0 L AH0 T|V
+DIAGNOSES|D AY1 AH0 G N OW2 Z IY0 Z|D AY2 AH0 G N OW1 S IY0 Z|V
+DICTATE|D IH0 K T EY1 T|D IH1 K T EY2 T|V
+DICTATES|D IH0 K T EY1 T S|D IH1 K T EY2 T S|V
+DIFFUSE|D IH0 F Y UW1 Z|D IH0 F Y UW1 S|V
+DIGEST|D AY0 JH EH1 S T|D AY1 JH EH0 S T|V
+DIGESTS|D AY2 JH EH1 S T S|D AY1 JH EH0 S T S|V
+DISCARD|D IH0 S K AA1 R D|D IH1 S K AA0 R D|V
+DISCARDS|D IH0 S K AA1 R D Z|D IH1 S K AA0 R D Z|V
+DISCHARGE|D IH0 S CH AA1 R JH|D IH1 S CH AA2 R JH|V
+DISCHARGES|D IH0 S CH AA1 R JH AH0 Z|D IH1 S CH AA2 R JH AH0 Z|V
+DISCOUNT|D IH0 S K AW1 N T|D IH1 S K AW0 N T|V
+DISCOUNTS|D IH0 S K AW1 N T S|D IH1 S K AW2 N T S|V
+DISCOURSE|D IH0 S K AO1 R S|D IH1 S K AO0 R S|V
+DISCOURSES|D IH0 S K AO1 R S IH0 Z|D IH1 S K AO0 R S IH0 Z|V
+DOCUMENT|D AA1 K Y UW0 M EH0 N T|D AA1 K Y AH0 M AH0 N T|V
+DOCUMENTS|D AA1 K Y UW0 M EH0 N T S|D AA1 K Y AH0 M AH0 N T S|V
+DOGGED|D AO1 G IH0 D|D AO1 G D|V
+DUPLICATE|D UW1 P L AH0 K EY2 T|D UW1 P L AH0 K AH0 T|V
+DUPLICATES|D UW1 P L AH0 K EY2 T S|D UW1 P L AH0 K AH0 T S|V
+EJACULATE|IH0 JH AE1 K Y UW0 L EY2 T|IH0 JH AE1 K Y UW0 L AH0 T|V
+EJACULATES|IH0 JH AE1 K Y UW0 L EY2 T S|IH0 JH AE1 K Y UW0 L AH0 T S|V
+ELABORATE|IH0 L AE1 B ER0 EY2 T|IH0 L AE1 B R AH0 T|V
+ENTRANCE|IH0 N T R AH1 N S|EH1 N T R AH0 N S|V
+ENTRANCES|IH0 N T R AH1 N S AH0 Z|EH1 N T R AH0 N S AH0 Z|V
+ENVELOPE|IH0 N V EH1 L AH0 P|EH1 N V AH0 L OW2 P|V
+ENVELOPES|IH0 N V EH1 L AH0 P S|EH1 N V AH0 L OW2 P S|V
+ESCORT|EH0 S K AO1 R T|EH1 S K AO0 R T|V
+ESCORTS|EH0 S K AO1 R T S|EH1 S K AO0 R T S|V
+ESSAY|EH0 S EY1|EH1 S EY2|V
+ESSAYS|EH0 S EY1 Z|EH1 S EY2 Z|V
+ESTIMATE|EH1 S T AH0 M EY2 T|EH1 S T AH0 M AH0 T|V
+ESTIMATES|EH1 S T AH0 M EY2 T S|EH1 S T AH0 M AH0 T S|V
+EXCESS|IH0 K S EH1 S|EH1 K S EH2 S|V
+EXCISE|EH0 K S AY1 S|EH1 K S AY0 Z|V
+EXCUSE|IH0 K S K Y UW1 Z|IH0 K S K Y UW1 S|V
+EXCUSES|IH0 K S K Y UW1 Z IH0 Z|IH0 K S K Y UW1 S IH0 Z|V
+EXPATRIATE|EH0 K S P EY1 T R IY0 EY2 T|EH0 K S P EY1 T R IY0 AH0 T|V
+EXPATRIATES|EH0 K S P EY1 T R IY0 EY2 T S|EH0 K S P EY1 T R IY0 AH0 T S|V
+EXPLOIT|EH1 K S P L OY2 T|EH2 K S P L OY1 T|V
+EXPLOITS|EH1 K S P L OY2 T S|EH2 K S P L OY1 T S|V
+EXPORT|IH0 K S P AO1 R T|EH1 K S P AO0 R T|V
+EXPORTS|IH0 K S P AO1 R T S|EH1 K S P AO0 R T S|V
+EXTRACT|IH0 K S T R AE1 K T|EH1 K S T R AE2 K T|V
+EXTRACTS|IH0 K S T R AE1 K T S|EH1 K S T R AE2 K T S|V
+FERMENT|F ER0 M EH1 N T|F ER1 M EH0 N T|V
+FERMENTS|F ER0 M EH1 N T S|F ER1 M EH0 N T S|V
+FRAGMENT|F R AE1 G M AH0 N T|F R AE0 G M EH1 N T|V
+FRAGMENTS|F R AE0 G M EH1 N T S|F R AE1 G M AH0 N T S|V
+FREQUENT|F R IY1 K W EH2 N T|F R IY1 K W AH0 N T|V
+GRADUATE|G R AE1 JH AH0 W EY2 T|G R AE1 JH AH0 W AH0 T|V
+GRADUATES|G R AE1 JH AH0 W EY2 T S|G R AE1 JH AH0 W AH0 T S|V
+HOUSE|HH AW1 Z|HH AW1 S|V
+IMPACT|IH2 M P AE1 K T|IH1 M P AE0 K T|V
+IMPACTS|IH2 M P AE1 K T S|IH1 M P AE0 K T S|V
+IMPLANT|IH2 M P L AE1 N T|IH1 M P L AE2 N T|V
+IMPLANTS|IH2 M P L AE1 N T S|IH1 M P L AE2 N T S|V
+IMPLEMENT|IH1 M P L AH0 M EH0 N T|IH1 M P L AH0 M AH0 N T|V
+IMPLEMENTS|IH1 M P L AH0 M EH0 N T S|IH1 M P L AH0 M AH0 N T S|V
+IMPORT|IH2 M P AO1 R T|IH1 M P AO2 R T|V
+IMPORTS|IH2 M P AO1 R T S|IH1 M P AO2 R T S|V
+IMPRESS|IH0 M P R EH1 S|IH1 M P R EH0 S|V
+IMPRINT|IH1 M P R IH0 N T|IH2 M P R IH1 N T|V
+IMPRINTS|IH2 M P R IH1 N T S|IH1 M P R IH0 N T S|V
+INCENSE|IH2 N S EH1 N S|IH1 N S EH2 N S|V
+INCLINE|IH2 N K L AY1 N|IH1 N K L AY0 N|V
+INCLINES|IH2 N K L AY1 N Z|IH1 N K L AY0 N Z|V
+INCORPORATE|IH2 N K AO1 R P ER0 EY2 T|IH2 N K AO1 R P ER0 AH0 T|V
+INCREASE|IH2 N K R IY1 S|IH1 N K R IY2 S|V
+INCREASES|IH2 N K R IY1 S IH0 Z|IH1 N K R IY2 S IH0 Z|V
+INDENT|IH2 N D EH1 N T|IH1 N D EH0 N T|V
+INDENTS|IH2 N D EH1 N T S|IH1 N D EH0 N T S|V
+INEBRIATE|IH2 N EH1 B R IY0 EY2 T|IH2 N EH1 B R IY0 AH0 T|V
+INEBRIATES|IH2 N EH1 B R IY0 EY2 T S|IH2 N EH1 B R IY0 AH0 T S|V
+INITIATE|IH2 N IH1 SH IY0 EY2 T|IH2 N IH1 SH IY0 AH0 T|V
+INITIATES|IH2 N IH1 SH IY0 EY2 T S|IH2 N IH1 SH IY0 AH0 T S|V
+INLAY|IH2 N L EY1|IH1 N L EY2|V
+INLAYS|IH2 N L EY1 Z|IH1 N L EY2 Z|V
+INSERT|IH2 N S ER1 T|IH1 N S ER2 T|V
+INSERTS|IH2 N S ER1 T S|IH1 N S ER2 T S|V
+INSET|IH2 N S EH1 T|IH1 N S EH2 T|V
+INSETS|IH2 N S EH1 T S|IH1 N S EH2 T S|V
+INSTINCT|IH2 N S T IH1 NG K T|IH1 N S T IH0 NG K T|V
+INSULT|IH2 N S AH1 L T|IH1 N S AH2 L T|V
+INSULTS|IH2 N S AH1 L T S|IH1 N S AH2 L T S|V
+INTERCHANGE|IH2 T ER0 CH EY1 N JH|IH1 N T ER0 CH EY2 N JH|V
+INTERCHANGES|IH2 T ER0 CH EY1 N JH IH0 Z|IH1 N T ER0 CH EY2 N JH IH0 Z|V
+INTERDICT|IH2 N T ER0 D IH1 K T|IH1 N T ER0 D IH2 K T|V
+INTERDICTS|IH2 N T ER0 D IH1 K T S|IH1 N T ER0 D IH2 K T S|V
+INTERN|IH0 N T ER1 N|IH1 N T ER0 N|V
+INTERNS|IH0 N T ER1 N Z|IH1 N T ER0 N Z|V
+INTIMATE|IH1 N T IH0 M EY2 T|IH1 N T AH0 M AH0 T|V
+INTIMATES|IH1 N T IH0 M EY2 T S|IH1 N T AH0 M AH0 T S|V
+INTROVERT|IH2 N T R AO0 V ER1 T|IH1 N T R AO0 V ER2 T|V
+INTROVERTS|IH2 N T R AO0 V ER1 T S|IH1 N T R AO0 V ER2 T S|V
+INVERSE|IH1 N V ER0 S|IH2 N V ER1 S|V
+INVITE|IH2 N V AY1 T|IH1 N V AY0 T|V
+INVITES|IH2 N V AY1 T S|IH1 N V AY0 T S|V
+JAGGED|JH AE1 G D|JH AE1 G IH0 D|V
+LEARNED|L ER1 N IH0 D|L ER1 N D|V
+LEGITIMATE|L AH0 JH IH1 T AH0 M EY2 T|L AH0 JH IH1 T AH0 M AH0 T|V
+MANDATE|M AE1 N D EY2 T|M AE2 N D EY1 T|V
+MISCONDUCT|M IH2 S K AA1 N D AH0 K T|M IH2 S K AA0 N D AH1 K T|V
+MISPRINT|M IH2 S P R IH1 N T|M IH1 S P R IH0 N T|V
+MISPRINTS|M IH2 S P R IH1 N T S|M IH1 S P R IH0 N T S|V
+MISUSE|M IH0 S Y UW1 S|M IH0 S Y UW1 Z|V
+MISUSES|M IH0 S Y UW1 Z IH0 Z|M IH0 S Y UW1 S IH0 Z|V
+MODERATE|M AA1 D ER0 EY2 T|M AA1 D ER0 AH0 T|V
+MODERATES|M AA1 D ER0 EY2 T S|M AA1 D ER0 AH0 T S|V
+MOUTH|M AW1 TH|M AW1 DH|V
+MOUTHS|M AW1 DH Z|M AW1 TH S|V
+OBJECT|AA1 B JH EH0 K T|AH0 B JH EH1 K T|V
+OBJECTS|AH0 B JH EH1 K T S|AA1 B JH EH0 K T S|V
+ORNAMENT|AO1 R N AH0 M EH0 N T|AO1 R N AH0 M AH0 N T|V
+ORNAMENTS|AO1 R N AH0 M EH0 N T S|AO1 R N AH0 M AH0 N T S|V
+OVERCHARGE|OW2 V ER0 CH AA1 R JH|OW1 V ER0 CH AA2 R JH|V
+OVERCHARGES|OW2 V ER0 CH AA1 R JH IH0 Z|OW1 V ER0 CH AA2 R JH IH0 Z|V
+OVERFLOW|OW2 V ER0 F L OW1|OW1 V ER0 F L OW2|V
+OVERFLOWS|OW2 V ER0 F L OW1 Z|OW1 V ER0 F L OW2 Z|V
+OVERHANG|OW2 V ER0 HH AE1 NG|OW1 V ER0 HH AE2 NG|V
+OVERHANGS|OW2 V ER0 HH AE1 NG Z|OW1 V ER0 HH AE2 NG Z|V
+OVERHAUL|OW2 V ER0 HH AO1 L|OW1 V ER0 HH AO2 L|V
+OVERHAULS|OW2 V ER0 HH AO1 L Z|OW1 V ER0 HH AO2 L Z|V
+OVERLAP|OW2 V ER0 L AE1 P|OW1 V ER0 L AE2 P|V
+OVERLAPS|OW2 V ER0 L AE1 P S|OW1 V ER0 L AE2 P S|V
+OVERLAY|OW2 V ER0 L EY1|OW1 V ER0 L EY2|V
+OVERLAYS|OW2 V ER0 L EY1 Z|OW1 V ER0 L EY2 Z|V
+OVERWORK|OW2 V ER0 W ER1 K|OW1 V ER0 W ER2 K|V
+PERFECT|P ER0 F EH1 K T|P ER1 F IH2 K T|V
+PERFUME|P ER0 F Y UW1 M|P ER1 F Y UW0 M|V
+PERFUMES|P ER0 F Y UW1 M Z|P ER1 F Y UW0 M Z|V
+PERMIT|P ER0 M IH1 T|P ER1 M IH2 T|V
+PERMITS|P ER0 M IH1 T S|P ER1 M IH2 T S|V
+PERVERT|P ER0 V ER1 T|P ER1 V ER0 T|V
+PERVERTS|P ER0 V ER1 T S|P ER1 V ER0 T S|V
+PONTIFICATE|P AA0 N T IH1 F AH0 K AH0 T|P AA0 N T IH1 F AH0 K EY2 T|V
+PONTIFICATES|P AA0 N T IH1 F AH0 K EY2 T S|P AA0 N T IH1 F AH0 K AH0 T S|V
+PRECIPITATE|P R IH0 S IH1 P IH0 T AH0 T|P R IH0 S IH1 P IH0 T EY2 T|V
+PREDICATE|P R EH1 D IH0 K AH0 T|P R EH1 D AH0 K EY2 T|V
+PREDICATES|P R EH1 D AH0 K EY2 T S|P R EH1 D IH0 K AH0 T S|V
+PREFIX|P R IY2 F IH1 K S|P R IY1 F IH0 K S|V
+PREFIXES|P R IY2 F IH1 K S IH0 JH|P R IY1 F IH0 K S IH0 JH|V
+PRESAGE|P R EH2 S IH1 JH|P R EH1 S IH0 JH|V
+PRESAGES|P R EH2 S IH1 JH IH0 JH|P R EH1 S IH0 JH IH0 JH|V
+PRESENT|P R IY0 Z EH1 N T|P R EH1 Z AH0 N T|V
+PRESENTS|P R IY0 Z EH1 N T S|P R EH1 Z AH0 N T S|V
+PROCEEDS|P R AH0 S IY1 D Z|P R OW1 S IY0 D Z|V
+PROCESS|P R AO2 S EH1 S|P R AA1 S EH2 S|V
+PROCESSES|P R AA1 S EH0 S AH0 Z|P R AO2 S EH1 S AH0 Z|V
+PROCESSING|P R AA0 S EH1 S IH0 NG|P R AA1 S EH0 S IH0 NG|V
+PRODUCE|P R AH0 D UW1 S|P R OW1 D UW0 S|V
+PROGRESS|P R AH0 G R EH1 S|P R AA1 G R EH2 S|V
+PROGRESSES|P R OW0 G R EH1 S AH0 Z|P R AA1 G R EH2 S AH0 Z|V
+PROJECT|P R AA0 JH EH1 K T|P R AA1 JH EH0 K T|V
+PROJECTS|P R AA0 JH EH1 K T S|P R AA1 JH EH0 K T S|V
+PROSPECT|P R AH2 S P EH1 K T|P R AA1 S P EH0 K T|V
+PROSPECTS|P R AH2 S P EH1 K T S|P R AA1 S P EH0 K T S|V
+PROSTRATE|P R AA0 S T R EY1 T|P R AA1 S T R EY0 T|V
+PROTEST|P R AH0 T EH1 S T|P R OW1 T EH2 S T|V
+PROTESTS|P R AH0 T EH1 S T S|P R OW1 T EH2 S T S|V
+PURPORT|P ER0 P AO1 R T|P ER1 P AO2 R T|V
+QUADRUPLE|K W AA1 D R UW0 P AH0 L|K W AA0 D R UW1 P AH0 L|V
+QUADRUPLES|K W AA0 D R UW1 P AH0 L Z|K W AA1 D R UW0 P AH0 L Z|V
+RAGGED|R AE1 G D|R AE1 G AH0 D|V
+RAMPAGE|R AE2 M P EY1 JH|R AE1 M P EY2 JH|V
+RAMPAGES|R AE2 M P EY1 JH IH0 Z|R AE1 M P EY2 JH IH0 Z|V
+READ|R IY1 D|R EH1 D|VBD
+REBEL|R EH1 B AH0 L|R IH0 B EH1 L|V
+REBELS|R IH0 B EH1 L Z|R EH1 B AH0 L Z|V
+REBOUND|R IY0 B AW1 N D|R IY1 B AW0 N D|V
+REBOUNDS|R IY0 B AW1 N D Z|R IY1 B AW0 N D Z|V
+RECALL|R IH0 K AO1 L|R IY1 K AO2 L|V
+RECALLS|R IH0 K AO1 L Z|R IY1 K AO2 L Z|V
+RECAP|R IH0 K AE1 P|R IY1 K AE2 P|V
+RECAPPED|R IH0 K AE1 P T|R IY1 K AE2 P T|V
+RECAPPING|R IH0 K AE1 P IH0 NG|R IY1 K AE2 P IH0 NG|V
+RECAPS|R IH0 K AE1 P S|R IY1 K AE2 P S|V
+RECOUNT|R IY2 K AW1 N T| R IH1 K AW0 N T|V
+RECOUNTS|R IY2 K AW1 N T S| R IH1 K AW0 N T S|V
+RECORD|R IH0 K AO1 R D|R EH1 K ER0 D|V
+RECORDS|R IH0 K AO1 R D Z|R EH1 K ER0 D Z|V
+REFILL|R IY0 F IH1 L|R IY1 F IH0 L|V
+REFILLS|R IY0 F IH1 L Z|R IY1 F IH0 L Z|V
+REFIT|R IY0 F IH1 T|R IY1 F IH0 T|V
+REFITS|R IY0 F IH1 T S|R IY1 F IH0 T S|V
+REFRESH|R IH0 F R EH1 SH|R IH1 F R EH0 SH|V
+REFUND|R IH0 F AH1 N D|R IY1 F AH2 N D|V
+REFUNDS|R IH0 F AH1 N D Z|R IY1 F AH2 N D Z|V
+REFUSE|R IH0 F Y UW1 Z|R EH1 F Y UW2 Z|V
+REGENERATE|R IY0 JH EH1 N ER0 EY2 T|R IY0 JH EH1 N ER0 AH0 T|V
+REHASH|R IY0 HH AE1 SH|R IY1 HH AE0 SH|V
+REHASHES|R IY0 HH AE1 SH IH0 Z|R IY1 HH AE0 SH IH0 Z|V
+REINCARNATE|R IY2 IH0 N K AA1 R N EY2 T|R IY2 IH0 N K AA1 R N AH0 T|V
+REJECT|R IH0 JH EH1 K T|R IY1 JH EH0 K T|V
+REJECTS|R IH0 JH EH1 K T S|R IY1 JH EH0 K T S|V
+RELAY|R IY2 L EY1|R IY1 L EY2|V
+RELAYING|R IY2 L EY1 IH0 NG|R IY1 L EY2 IH0 NG|V
+RELAYS|R IY2 L EY1 Z|R IY1 L EY2 Z|V
+REMAKE|R IY2 M EY1 K|R IY1 M EY0 K|V
+REMAKES|R IY2 M EY1 K S|R IY1 M EY0 K S|V
+REPLAY|R IY0 P L EY1|R IY1 P L EY0|V
+REPLAYS|R IY0 P L EY1 Z|R IY1 P L EY0 Z|V
+REPRINT|R IY0 P R IH1 N T|R IY1 P R IH0 N T|V
+REPRINTS|R IY0 P R IH1 N T S|R IY1 P R IH0 N T S|V
+RERUN|R IY2 R AH1 N|R IY1 R AH0 N|V
+RERUNS|R IY2 R AH1 N Z|R IY1 R AH0 N Z|V
+RESUME|R IY0 Z UW1 M|R EH1 Z AH0 M EY2|V
+RETAKE|R IY0 T EY1 K|R IY1 T EY0 K|V
+RETAKES|R IY0 T EY1 K S|R IY1 T EY0 K S|V
+RETHINK|R IY2 TH IH1 NG K|R IY1 TH IH0 NG K|V
+RETHINKS|R IY2 TH IH1 NG K S|R IY1 TH IH0 NG K S|V
+RETREAD|R IY2 T R EH1 D|R IY1 T R EH0 D|V
+RETREADS|R IY2 T R EH1 D Z|R IY1 T R EH0 D Z|V
+REWRITE|R IY0 R AY1 T|R IY1 R AY2 T|V
+REWRITES|R IY0 R AY1 T S|R IY1 R AY2 T S|V
+SEGMENT|S EH1 G M AH0 N T|S EH2 G M EH1 N T|V
+SEGMENTS|S EH2 G M EH1 N T S|S EH1 G M AH0 N T S|V
+SEPARATE|S EH1 P ER0 EY2 T|S EH1 P ER0 IH0 T|V
+SEPARATES|S EH1 P ER0 EY2 T S|S EH1 P ER0 IH0 T S|V
+SUBCONTRACT|S AH0 B K AA1 N T R AE2 K T|S AH2 B K AA0 N T R AE1 K T|V
+SUBCONTRACTS|S AH2 B K AA0 N T R AE1 K T S|S AH0 B K AA1 N T R AE2 K T S|V
+SUBJECT|S AH0 B JH EH1 K T|S AH1 B JH IH0 K T|V
+SUBJECTS|S AH0 B JH EH1 K T S|S AH1 B JH IH0 K T S|V
+SUBORDINATE|S AH0 B AO1 R D AH0 N EY2 T|S AH0 B AO1 R D AH0 N AH0 T|V
+SUBORDINATES|S AH0 B AO1 R D AH0 N EY2 T S|S AH0 B AO1 R D AH0 N AH0 T S|V
+SUPPLEMENT|S AH1 P L AH0 M EH0 N T|S AH1 P L AH0 M AH0 N T|V
+SUPPLEMENTS|S AH1 P L AH0 M EH0 N T S|S AH1 P L AH0 M AH0 N T S|V
+SURMISE|S ER0 M AY1 Z|S ER1 M AY0 Z|V
+SURMISES|S ER0 M AY1 Z IH0 Z|S ER1 M AY0 Z IH0 Z|V
+SURVEY|S ER0 V EY1|S ER1 V EY2|V
+SURVEYS|S ER0 V EY1 Z|S ER1 V EY2 Z|V
+SUSPECT|S AH0 S P EH1 K T|S AH1 S P EH2 K T|V
+SUSPECTS|S AH0 S P EH1 K T S|S AH1 S P EH2 K T S|V
+SYNDICATE|S IH1 N D AH0 K EY2 T|S IH1 N D IH0 K AH0 T|V
+SYNDICATES|S IH1 N D IH0 K EY2 T S|S IH1 N D IH0 K AH0 T S|V
+TORMENT|T AO1 R M EH2 N T|T AO0 R M EH1 N T|V
+TRANSFER|T R AE0 N S F ER1|T R AE1 N S F ER0|V
+TRANSFERS|T R AE0 N S F ER1 Z|T R AE1 N S F ER0 Z|V
+TRANSPLANT|T R AE0 N S P L AE1 N T|T R AE1 N S P L AE0 N T|V
+TRANSPLANTS|T R AE0 N S P L AE1 N T S|T R AE1 N S P L AE0 N T S|V
+TRANSPORT|T R AE0 N S P AO1 R T|T R AE1 N S P AO0 R T|V
+TRANSPORTS|T R AE0 N S P AO1 R T S|T R AE1 N S P AO0 R T S|V
+TRIPLICATE|T R IH1 P L IH0 K EY2 T|T R IH1 P L IH0 K AH0 T|V
+TRIPLICATES|T R IH1 P L IH0 K EY2 T S|T R IH1 P L IH0 K AH0 T S|V
+UNDERCUT|AH2 N D ER0 K AH1 T|AH1 N D ER0 K AH2 T|V
+UNDERESTIMATE|AH1 N D ER0 EH1 S T AH0 M EY2 T|AH1 N D ER0 EH1 S T AH0 M AH0 T|V
+UNDERESTIMATES|AH1 N D ER0 EH1 S T AH0 M EY2 T S|AH1 N D ER0 EH1 S T AH0 M AH0 T S|V
+UNDERLINE|AH2 N D ER0 L AY1 N|AH1 N D ER0 L AY2 N|V
+UNDERLINES|AH2 N D ER0 L AY1 N Z|AH1 N D ER0 L AY2 N Z|V
+UNDERTAKING|AH2 N D ER0 T EY1 K IH0 NG|AH1 N D ER0 T EY2 K IH0 NG|V
+UNDERTAKINGS|AH2 N D ER0 T EY1 K IH0 NG Z|AH1 N D ER0 T EY2 K IH0 NG Z|V
+UNUSED|AH0 N Y UW1 Z D|AH0 N Y UW1 S T|V
+UPGRADE|AH0 P G R EY1 D|AH1 P G R EY0 D|V
+UPGRADES|AH0 P G R EY1 D Z|AH1 P G R EY0 D Z|V
+UPLIFT|AH2 P L IH1 F T|AH1 P L IH0 F T|V
+UPSET|AH0 P S EH1 T|AH1 P S EH2 T|V
+UPSETS|AH0 P S EH1 T S|AH1 P S EH2 T S|V
+USE|Y UW1 Z|Y UW1 S|V
+USED|Y UW1 Z D|Y UW1 S T|VBN
+USES|Y UW1 Z IH0 Z|Y UW1 S IH0 Z|V

dataset/google.py ADDED Viewed

	@@ -0,0 +1,188 @@

+import math, os, re, sys
+from pathlib import Path
+import numpy as np
+import pandas as pd
+from multiprocessing import Pool
+from scipy.io import wavfile
+import tensorflow as tf
+from tensorflow.keras.utils import Sequence, OrderedEnqueuer
+from tensorflow.keras import layers
+from tensorflow.keras.preprocessing.sequence import pad_sequences
+sys.path.append(os.path.dirname(__file__))
+from g2p.g2p_en.g2p import G2p
+import warnings
+warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
+np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
+class GoogleCommandsDataloader(Sequence):
+    def __init__(self,
+                 batch_size,
+                 fs = 16000,
+                 wav_dir='/home/DB/google_speech_commands',
+                 target_list=['yes', 'no', 'up', 'down', 'left', 'right', 'on', 'off', 'stop', 'go'],
+                 features='g2p_embed', # phoneme, g2p_embed, both ...
+                 shuffle=True,
+                 testset_only=False,
+                 pkl=None,
+                 ):
+        phonemes = ["<pad>", ] + ['AA0', 'AA1', 'AA2', 'AE0', 'AE1', 'AE2', 'AH0', 'AH1', 'AH2', 'AO0',
+                                    'AO1', 'AO2', 'AW0', 'AW1', 'AW2', 'AY0', 'AY1', 'AY2', 'B', 'CH',
+                                    'D', 'DH', 'EH0', 'EH1', 'EH2', 'ER0', 'ER1', 'ER2', 'EY0', 'EY1',
+                                    'EY2', 'F', 'G', 'HH', 'IH0', 'IH1', 'IH2', 'IY0', 'IY1', 'IY2',
+                                    'JH', 'K', 'L', 'M', 'N', 'NG', 'OW0', 'OW1', 'OW2', 'OY0',
+                                    'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH0', 'UH1',
+                                    'UH2', 'UW', 'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH',
+                                    ' ']
+        self.p2idx = {p: idx for idx, p in enumerate(phonemes)}
+        self.idx2p = {idx: p for idx, p in enumerate(phonemes)}
+        self.batch_size = batch_size
+        self.fs = fs
+        self.wav_dir = wav_dir
+        self.target_list = [x.lower() for x in target_list]
+        self.testset_only = testset_only
+        self.features = features
+        self.shuffle = shuffle
+        self.pkl = pkl
+        self.nPhoneme = len(phonemes)
+        self.g2p = G2p()
+        self.__prep__()
+        self.on_epoch_end()
+    def __prep__(self):
+        self.data = pd.DataFrame(columns=['wav', 'text', 'duration', 'label'])
+        if (self.pkl is not None) and (os.path.isfile(self.pkl)):
+            print(">> Load dataset from {}".format(self.pkl))
+            self.data = pd.read_pickle(self.pkl)
+        else:
+            print(">> Make dataset from {}".format(self.wav_dir))
+            target_dict = {}
+            idx = 0
+            for target in self.target_list:
+                print(">> Extract from {}".format(target))
+                if self.testset_only:
+                    test_list = os.path.join(self.wav_dir, 'testing_list.txt')
+                    with open(test_list, "r") as f:
+                        wav_list = f.readlines()
+                        wav_list = [os.path.join(self.wav_dir, x.strip()) for x in wav_list]
+                        wav_list = [x for x in wav_list if target == x.split('/')[-2]]
+                else:
+                    wav_list = [str(x) for x in Path(os.path.join(self.wav_dir, target)).rglob('*.wav')]
+                for wav in wav_list:
+                    anchor_text = wav.split('/')[-2].lower()
+                    duration = float(wavfile.read(wav)[1].shape[-1])/self.fs
+                    for comparison_text in self.target_list:
+                        label = 1 if anchor_text == comparison_text else 0
+                        target_dict[idx] = {
+                            'wav': wav,
+                            'text': comparison_text,
+                            'duration': duration,
+                            'label': label
+                            }
+                        idx += 1
+            self.data = self.data.append(pd.DataFrame.from_dict(target_dict, 'index'), ignore_index=True)
+            # g2p & p2idx by g2p_en package
+            print(">> Convert word to phoneme")
+            self.data['phoneme'] = self.data['text'].apply(lambda x: self.g2p(re.sub(r"[^a-zA-Z0-9]+", ' ', x)))
+            print(">> Convert phoneme to index")
+            self.data['pIndex'] = self.data['phoneme'].apply(lambda x: [self.p2idx[t] for t in x])
+            print(">> Compute phoneme embedding")
+            self.data['g2p_embed'] = self.data['text'].apply(lambda x: self.g2p.embedding(x))
+            if (self.pkl is not None) and (not os.path.isfile(self.pkl)):
+                self.data.to_pickle(self.pkl)
+        # Get longest data
+        self.data = self.data.sort_values(by='duration').reset_index(drop=True)
+        self.wav_list = self.data['wav'].values
+        self.idx_list = self.data['pIndex'].values
+        self.emb_list = self.data['g2p_embed'].values
+        self.lab_list = self.data['label'].values
+        # Set dataloader params.
+        self.len = len(self.data)
+        self.maxlen_t = int((int(self.data['text'].apply(lambda x: len(x)).max() / 10) + 1) * 10)
+        self.maxlen_a = int((int(self.data['duration'].values[-1] / 0.5) + 1 ) * self.fs / 2)
+    def __len__(self):
+        # return total batch-wise length
+        return math.ceil(self.len / self.batch_size)
+    def _load_wav(self, wav):
+        return np.array(wavfile.read(wav)[1]).astype(np.float32) / 32768.0
+    def __getitem__(self, idx):
+        # chunking
+        indices = self.indices[idx * self.batch_size : (idx + 1) * self.batch_size]
+        # load inputs
+        batch_x = [np.array(wavfile.read(self.wav_list[i])[1]).astype(np.float32) / 32768.0 for i in indices]
+        if self.features == 'both':
+            batch_p = [np.array(self.idx_list[i]).astype(np.int32) for i in indices]
+            batch_e = [np.array(self.emb_list[i]).astype(np.float32) for i in indices]
+        else:
+            if self.features == 'phoneme':
+                batch_y = [np.array(self.idx_list[i]).astype(np.int32) for i in indices]
+            elif self.features == 'g2p_embed':
+                batch_y = [np.array(self.emb_list[i]).astype(np.float32) for i in indices]
+        # load outputs
+        batch_z = [np.array([self.lab_list[i]]).astype(np.float32) for i in indices]
+        # padding and masking
+        pad_batch_x = pad_sequences(np.array(batch_x), maxlen=self.maxlen_a, value=0.0, padding='post', dtype=batch_x[0].dtype)
+        if self.features == 'both':
+            pad_batch_p = pad_sequences(np.array(batch_p), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_p[0].dtype)
+            pad_batch_e = pad_sequences(np.array(batch_e), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_e[0].dtype)
+        else:
+            pad_batch_y = pad_sequences(np.array(batch_y), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_y[0].dtype)
+        pad_batch_z = pad_sequences(np.array(batch_z), value=0.0, padding='post', dtype=batch_z[0].dtype)
+        if self.features == 'both':
+            return pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z
+        else:
+            return pad_batch_x, pad_batch_y, pad_batch_z
+    def on_epoch_end(self):
+        self.indices = np.arange(self.len)
+        if self.shuffle == True:
+            np.random.shuffle(self.indices)
+def convert_sequence_to_dataset(dataloader):
+    def data_generator():
+        for i in range(dataloader.__len__()):
+            if dataloader.features == 'both':
+                pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z = dataloader[i]
+                yield pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z
+            else:
+                pad_batch_x, pad_batch_y, pad_batch_z = dataloader[i]
+                yield pad_batch_x, pad_batch_y, pad_batch_z
+    if dataloader.features == 'both':
+        data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+            tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, dataloader.maxlen_t), dtype=tf.int32),
+            tf.TensorSpec(shape=(None, dataloader.maxlen_t, 256), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+        )
+    else:
+        data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+            tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, dataloader.maxlen_t) if dataloader.features == 'phoneme' else (None, dataloader.maxlen_t, 256),
+                        dtype=tf.int32 if dataloader.features == 'phoneme' else tf.float32),
+            tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+        )
+    # data_dataset = data_dataset.cache()
+    data_dataset = data_dataset.prefetch(1)
+    return data_dataset
+if __name__ == '__main__':
+    dataloader = GoogleCommandsDataloader(2048, testset_only=True, pkl='/home/DB/google_speech_commands/google_testset.pkl', features='g2p_embed')

dataset/google_infe202405.py ADDED Viewed

	@@ -0,0 +1,192 @@

+import math, os, re, sys
+from pathlib import Path
+import numpy as np
+import pandas as pd
+from multiprocessing import Pool
+from scipy.io import wavfile
+import tensorflow as tf
+from tensorflow.keras.utils import Sequence, OrderedEnqueuer
+from tensorflow.keras import layers
+from tensorflow.keras.preprocessing.sequence import pad_sequences
+sys.path.append(os.path.dirname(__file__))
+from g2p.g2p_en.g2p import G2p
+import warnings
+warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
+np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
+class GoogleCommandsDataloader(Sequence):
+    def __init__(self,
+                 batch_size,
+                 fs = 16000,
+                 wav_dir='/home/DB/kws_google/data2',
+                 target_list=['bed','three','bird','cat','dog','eight','five','four','happy','house','marvin','nine',
+                  'one','seven','sheila','six','tree','two','wow','zero'],
+                 features='g2p_embed', # phoneme, g2p_embed, both ...
+                 shuffle=True,
+                 testset_only=False,
+                 pkl=None,
+                 ):
+        phonemes = ["<pad>", ] + ['AA0', 'AA1', 'AA2', 'AE0', 'AE1', 'AE2', 'AH0', 'AH1', 'AH2', 'AO0',
+                                    'AO1', 'AO2', 'AW0', 'AW1', 'AW2', 'AY0', 'AY1', 'AY2', 'B', 'CH',
+                                    'D', 'DH', 'EH0', 'EH1', 'EH2', 'ER0', 'ER1', 'ER2', 'EY0', 'EY1',
+                                    'EY2', 'F', 'G', 'HH', 'IH0', 'IH1', 'IH2', 'IY0', 'IY1', 'IY2',
+                                    'JH', 'K', 'L', 'M', 'N', 'NG', 'OW0', 'OW1', 'OW2', 'OY0',
+                                    'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH0', 'UH1',
+                                    'UH2', 'UW', 'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH',
+                                    ' ']
+        self.p2idx = {p: idx for idx, p in enumerate(phonemes)}
+        self.idx2p = {idx: p for idx, p in enumerate(phonemes)}
+        self.batch_size = batch_size
+        self.fs = fs
+        self.wav_dir = wav_dir
+        self.target_list = [x.lower() for x in target_list]
+        self.testset_only = testset_only
+        self.features = features
+        self.shuffle = shuffle
+        self.pkl = pkl
+        self.nPhoneme = len(phonemes)
+        self.g2p = G2p()
+        self.__prep__()
+        self.on_epoch_end()
+    def __prep__(self):
+        self.data = pd.DataFrame(columns=['wav', 'text', 'duration', 'label'])
+        if (self.pkl is not None) and (os.path.isfile(self.pkl)):
+            print(">> Load dataset from {}".format(self.pkl))
+            self.data = pd.read_pickle(self.pkl)
+        else:
+            print(">> Make dataset from {}".format(self.wav_dir))
+            target_dict = {}
+            idx = 0
+            for target in self.target_list:
+                print(">> Extract from {}".format(target))
+                if self.testset_only:
+                    test_list = os.path.join(self.wav_dir, 'testing_list.txt')
+                    with open(test_list, "r") as f:
+                        wav_list = f.readlines()
+                        wav_list = [os.path.join(self.wav_dir, x.strip()) for x in wav_list]
+                        wav_list = [x for x in wav_list if target == x.split('/')[-2]]
+                else:
+                    wav_list = [str(x) for x in Path(os.path.join(self.wav_dir, target)).rglob('*.wav')]
+                for wav in wav_list:
+                    anchor_text = wav.split('/')[-2].lower()
+                    duration = float(wavfile.read(wav)[1].shape[-1])/self.fs
+                    for comparison_text in self.target_list:
+                        label = 1 if anchor_text == comparison_text else 0
+                        target_dict[idx] = {
+                            'wav': wav,
+                            'text': comparison_text,
+                            'duration': duration,
+                            'label': label
+                            }
+                        idx += 1
+            self.data = self.data.append(pd.DataFrame.from_dict(target_dict, 'index'), ignore_index=True)
+            # g2p & p2idx by g2p_en package
+            print(">> Convert word to phoneme")
+            self.data['phoneme'] = self.data['text'].apply(lambda x: self.g2p(re.sub(r"[^a-zA-Z0-9]+", ' ', x)))
+            print(">> Convert phoneme to index")
+            self.data['pIndex'] = self.data['phoneme'].apply(lambda x: [self.p2idx[t] for t in x])
+            print(">> Compute phoneme embedding")
+            self.data['g2p_embed'] = self.data['text'].apply(lambda x: self.g2p.embedding(x))
+            if (self.pkl is not None) and (not os.path.isfile(self.pkl)):
+                self.data.to_pickle(self.pkl)
+        # Get longest data
+        self.wav_list = self.data['wav'].values
+        self.idx_list = self.data['pIndex'].values
+        self.emb_list = self.data['g2p_embed'].values
+        self.lab_list = self.data['label'].values
+        self.data = self.data.sort_values(by='duration').reset_index(drop=True)
+        # Set dataloader params.
+        self.len = len(self.data)
+        self.maxlen_t = int((int(self.data['text'].apply(lambda x: len(x)).max() / 10) + 1) * 10)
+        self.maxlen_a = int((int(self.data['duration'].values[-1] / 0.5) + 1 ) * self.fs / 2)
+    def __len__(self):
+        # return total batch-wise length
+        return math.ceil(self.len / self.batch_size)
+    def _load_wav(self, wav):
+        return np.array(wavfile.read(wav)[1]).astype(np.float32) / 32768.0
+    def __getitem__(self, idx):
+        # chunking
+        indices = self.indices[idx * self.batch_size : (idx + 1) * self.batch_size]
+        # load inputs
+        batch_x = [np.array(wavfile.read(self.wav_list[i])[1]).astype(np.float32) / 32768.0 for i in indices]
+        if self.features == 'both':
+            batch_p = [np.array(self.idx_list[i]).astype(np.int32) for i in indices]
+            batch_e = [np.array(self.emb_list[i]).astype(np.float32) for i in indices]
+        else:
+            if self.features == 'phoneme':
+                batch_y = [np.array(self.idx_list[i]).astype(np.int32) for i in indices]
+            elif self.features == 'g2p_embed':
+                batch_y = [np.array(self.emb_list[i]).astype(np.float32) for i in indices]
+        # load outputs
+        batch_z = [np.array([self.lab_list[i]]).astype(np.float32) for i in indices]
+        # padding and masking
+        pad_batch_x = pad_sequences(np.array(batch_x), maxlen=self.maxlen_a, value=0.0, padding='post', dtype=batch_x[0].dtype)
+        if self.features == 'both':
+            pad_batch_p = pad_sequences(np.array(batch_p), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_p[0].dtype)
+            pad_batch_e = pad_sequences(np.array(batch_e), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_e[0].dtype)
+        else:
+            pad_batch_y = pad_sequences(np.array(batch_y), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_y[0].dtype)
+        pad_batch_z = pad_sequences(np.array(batch_z), value=0.0, padding='post', dtype=batch_z[0].dtype)
+        if self.features == 'both':
+            return pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z
+        else:
+            return pad_batch_x, pad_batch_y, pad_batch_z
+    def on_epoch_end(self):
+        self.indices = np.arange(self.len)
+        # if self.shuffle == True:
+        #     np.random.shuffle(self.indices)
+def convert_sequence_to_dataset(dataloader):
+    def data_generator():
+        for i in range(dataloader.__len__()):
+            if dataloader.features == 'both':
+                pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z = dataloader[i]
+                yield pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z
+            else:
+                pad_batch_x, pad_batch_y, pad_batch_z = dataloader[i]
+                yield pad_batch_x, pad_batch_y, pad_batch_z
+    if dataloader.features == 'both':
+        data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+            tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, dataloader.maxlen_t), dtype=tf.int32),
+            tf.TensorSpec(shape=(None, dataloader.maxlen_t, 256), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+        )
+    else:
+        data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+            tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, dataloader.maxlen_t) if dataloader.features == 'phoneme' else (None, dataloader.maxlen_t, 256),
+                        dtype=tf.int32 if dataloader.features == 'phoneme' else tf.float32),
+            tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+        )
+    # data_dataset = data_dataset.cache()
+    # data_dataset = tf.data.Dataset.from_generator(data_generator, output_signature=output_signature)
+    data_dataset = data_dataset.prefetch(1)
+    return data_dataset
+if __name__ == '__main__':
+    dataloader = GoogleCommandsDataloader(2048, testset_only=True, pkl='/home/DB/google_speech_commands/google_testset.pkl', features='g2p_embed')

dataset/libriphrase.py ADDED Viewed

	@@ -0,0 +1,331 @@

+import math, os, re, sys
+from pathlib import Path
+import numpy as np
+import pandas as pd
+import Levenshtein
+from multiprocessing import Pool
+from scipy.io import wavfile
+import tensorflow as tf
+from tensorflow.keras.utils import Sequence, OrderedEnqueuer
+from tensorflow.keras import layers
+from tensorflow.keras.preprocessing.sequence import pad_sequences
+sys.path.append(os.path.dirname(__file__))
+from g2p.g2p_en.g2p import G2p
+import warnings
+warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
+np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
+class LibriPhraseDataloader(Sequence):
+    def __init__(self,
+                 batch_size,
+                 fs = 16000,
+                 wav_dir='/homw/DB/LibriPhrase/wav_dir',
+                 noise_dir='/homw/DB/noise',
+                 csv_dir='/homw/DB/LibriPhrase/data',
+                 train_csv = ['train_100h', 'train_360h'],
+                 test_csv = ['train_500h',],
+                 types='both', # easy, hard
+                 features='g2p_embed', # phoneme, g2p_embed, both ...
+                 train=True,
+                 shuffle=True,
+                 pkl=None,
+                 edit_dist=False,
+                 ):
+        phonemes = ["<pad>", ] + ['AA0', 'AA1', 'AA2', 'AE0', 'AE1', 'AE2', 'AH0', 'AH1', 'AH2', 'AO0',
+                                    'AO1', 'AO2', 'AW0', 'AW1', 'AW2', 'AY0', 'AY1', 'AY2', 'B', 'CH',
+                                    'D', 'DH', 'EH0', 'EH1', 'EH2', 'ER0', 'ER1', 'ER2', 'EY0', 'EY1',
+                                    'EY2', 'F', 'G', 'HH', 'IH0', 'IH1', 'IH2', 'IY0', 'IY1', 'IY2',
+                                    'JH', 'K', 'L', 'M', 'N', 'NG', 'OW0', 'OW1', 'OW2', 'OY0',
+                                    'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH0', 'UH1',
+                                    'UH2', 'UW', 'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH',
+                                    ' ']
+        self.p2idx = {p: idx for idx, p in enumerate(phonemes)}
+        self.idx2p = {idx: p for idx, p in enumerate(phonemes)}
+        self.batch_size = batch_size
+        self.fs = fs
+        self.wav_dir = wav_dir
+        self.csv_dir = csv_dir
+        self.noise_dir = noise_dir
+        self.train_csv = train_csv
+        self.test_csv = test_csv
+        self.types = types
+        self.features = features
+        self.train = train
+        self.shuffle = shuffle
+        self.pkl = pkl
+        self.edit_dist = edit_dist
+        self.nPhoneme = len(phonemes)
+        self.g2p = G2p()
+        self.__prep__()
+        self.on_epoch_end()
+    def __prep__(self):
+        if self.train:
+            print(">> Preparing noise DB")
+            noise_list = [str(x) for x in Path(self.noise_dir).rglob('*.wav')]
+            self.noise = np.array([])
+            for noise in noise_list:
+                fs, data = wavfile.read(noise)
+                assert fs == self.fs, ">> Error : Un-match sampling freq.\n{} -> {}".format(noise, fs)
+                data = data.astype(np.float32) / 32768.0
+                data = (data / np.max(data)) * 0.5
+                self.noise = np.append(self.noise, data)
+        self.data = pd.DataFrame(columns=['wav_label', 'wav', 'text', 'duration', 'label', 'type'])
+        if (self.pkl is not None) and (os.path.isfile(self.pkl)):
+            print(">> Load dataset from {}".format(self.pkl))
+            self.data = pd.read_pickle(self.pkl)
+        else:
+            for db in self.train_csv if self.train else self.test_csv:
+                csv_list = [str(x) for x in Path(self.csv_dir).rglob('*' + db + '*word*')]
+                for n_word in csv_list:
+                    print(">> processing : {} ".format(n_word))
+                    df = pd.read_csv(n_word)
+                    # Split train dataset to match & unmatch case
+                    anc_pos = df[['anchor_text', 'anchor', 'anchor_text', 'anchor_dur']]
+                    anc_neg = df[['anchor_text', 'anchor', 'comparison_text', 'anchor_dur', 'target', 'type']]
+                    com_pos = df[['comparison_text', 'comparison', 'comparison_text', 'comparison_dur']]
+                    com_neg = df[['comparison_text', 'comparison', 'anchor_text', 'comparison_dur', 'target', 'type']]
+                    anc_pos.columns = ['wav_label', 'anchor', 'anchor_text', 'anchor_dur']
+                    com_pos.columns = ['wav_label', 'comparison', 'comparison_text', 'comparison_dur']
+                    anc_pos['label'] = 1
+                    anc_pos['type'] = df['type']
+                    com_pos['label'] = 1
+                    com_pos['type'] = df['type']
+                    # Concat
+                    self.data = self.data.append(anc_pos.rename(columns={y: x for x, y in zip(self.data.columns, anc_pos.columns)}), ignore_index=True)
+                    self.data = self.data.append(anc_neg.rename(columns={y: x for x, y in zip(self.data.columns, anc_neg.columns)}), ignore_index=True)
+                    self.data = self.data.append(com_pos.rename(columns={y: x for x, y in zip(self.data.columns, com_pos.columns)}), ignore_index=True)
+                    self.data = self.data.append(com_neg.rename(columns={y: x for x, y in zip(self.data.columns, com_neg.columns)}), ignore_index=True)
+            # Append wav directory path
+            self.data['wav'] = self.data['wav'].apply(lambda x: os.path.join(self.wav_dir, x))
+            # g2p & p2idx by g2p_en package
+            print(">> Convert word to phoneme")
+            self.data['phoneme'] = self.data['text'].apply(lambda x: self.g2p(re.sub(r"[^a-zA-Z0-9]+", ' ', x)))
+            print(">> Convert speech word to phoneme")
+            self.data['wav_phoneme'] = self.data['wav_label'].apply(lambda x: self.g2p(re.sub(r"[^a-zA-Z0-9]+", ' ', x)))
+            print(">> Convert phoneme to index")
+            self.data['pIndex'] = self.data['phoneme'].apply(lambda x: [self.p2idx[t] for t in x])
+            print(">> Convert speech phoneme to index")
+            self.data['wav_pIndex'] = self.data['wav_phoneme'].apply(lambda x: [self.p2idx[t] for t in x])
+            print(">> Compute phoneme embedding")
+            self.data['g2p_embed'] = self.data['text'].apply(lambda x: self.g2p.embedding(x))
+            print(">> Calucate Edit distance ratio")
+            self.data['dist'] = self.data.apply(lambda x: Levenshtein.ratio(re.sub(r"[^a-zA-Z0-9]+", ' ', x['wav_label']), re.sub(r"[^a-zA-Z0-9]+", ' ', x['text'])), axis=1)
+            if (self.pkl is not None) and (not os.path.isfile(self.pkl)):
+                self.data.to_pickle(self.pkl)
+        # Masking dataset type
+        if self.types == 'both':
+            pass
+        elif self.types == 'easy':
+            self.data = self.data.loc[self.data['type'] == 'diffspk_easyneg']
+        elif self.types == 'hard':
+            self.data = self.data.loc[self.data['type'] == 'diffspk_hardneg']
+        # Get longest data
+        self.data = self.data.sort_values(by='duration').reset_index(drop=True)
+        self.wav_list = self.data['wav'].values
+        self.idx_list = self.data['pIndex'].values
+        self.sIdx_list = self.data['wav_pIndex'].values
+        self.emb_list = self.data['g2p_embed'].values
+        self.lab_list = self.data['label'].values
+        if self.edit_dist:
+            self.dist_list = self.data['dist'].values
+        # Set dataloader params.
+        self.len = len(self.data)
+        self.maxlen_t = int((int(self.data['text'].apply(lambda x: len(x)).max() / 10) + 1) * 10)
+        self.maxlen_a = int((int(self.data['duration'].values[-1] / 0.5) + 1 ) * self.fs / 2)
+        self.maxlen_l = int((int(self.data['wav_label'].apply(lambda x: len(x)).max() / 10) + 1) * 10)
+    def __len__(self):
+        # return total batch-wise length
+        return math.ceil(self.len / self.batch_size)
+    def _load_wav(self, wav):
+        return np.array(wavfile.read(wav)[1]).astype(np.float32) / 32768.0
+    def _mixing_snr(self, clean, snr=[5, 15]):
+        def _cal_adjusted_rms(clean_rms, snr):
+            a = float(snr) / 20
+            noise_rms = clean_rms / (10**a)
+            return noise_rms
+        def _cal_rms(amp):
+            return np.sqrt(np.mean(np.square(amp), axis=-1))
+        start = np.random.randint(0, len(self.noise)-len(clean))
+        divided_noise = self.noise[start: start + len(clean)]
+        clean_rms = _cal_rms(clean)
+        noise_rms = _cal_rms(divided_noise)
+        adj_noise_rms = _cal_adjusted_rms(clean_rms, np.random.randint(snr[0], snr[1]))
+        adj_noise_amp = divided_noise * (adj_noise_rms / (noise_rms + 1e-7))
+        noisy = clean + adj_noise_amp
+        if np.max(noisy) > 1:
+            noisy = noisy / np.max(noisy)
+        return noisy
+    def __getitem__(self, idx):
+        # chunking
+        indices = self.indices[idx * self.batch_size : (idx + 1) * self.batch_size]
+        # load inputs
+        batch_x = [np.array(wavfile.read(self.wav_list[i])[1]).astype(np.float32) / 32768.0 for i in indices]
+        if self.features == 'both':
+            batch_p = [np.array(self.idx_list[i]).astype(np.int32) for i in indices]
+            batch_e = [np.array(self.emb_list[i]).astype(np.float32) for i in indices]
+        else:
+            if self.features == 'phoneme':
+                batch_y = [np.array(self.idx_list[i]).astype(np.int32) for i in indices]
+            elif self.features == 'g2p_embed':
+                batch_y = [np.array(self.emb_list[i]).astype(np.float32) for i in indices]
+        # load outputs
+        batch_z = [np.array([self.lab_list[i]]).astype(np.float32) for i in indices]
+        batch_l = [np.array(self.sIdx_list[i]).astype(np.int32) for i in indices]
+        batch_t = [np.array(self.idx_list[i]).astype(np.int32) for i in indices]
+        if self.edit_dist:
+            batch_d = [np.array([self.dist_list[i]]).astype(np.float32) for i in indices]
+        # padding and masking
+        pad_batch_x = pad_sequences(np.array(batch_x), maxlen=self.maxlen_a, value=0.0, padding='post', dtype=batch_x[0].dtype)
+        if self.features == 'both':
+            pad_batch_p = pad_sequences(np.array(batch_p), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_p[0].dtype)
+            pad_batch_e = pad_sequences(np.array(batch_e), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_e[0].dtype)
+        else:
+            pad_batch_y = pad_sequences(np.array(batch_y), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_y[0].dtype)
+        pad_batch_z = pad_sequences(np.array(batch_z), value=0.0, padding='post', dtype=batch_z[0].dtype)
+        pad_batch_l = pad_sequences(np.array(batch_l), maxlen=self.maxlen_l, value=0.0, padding='post', dtype=batch_l[0].dtype)
+        pad_batch_t = pad_sequences(np.array(batch_t), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_t[0].dtype)
+        if self.edit_dist:
+            pad_batch_d = pad_sequences(np.array(batch_d), value=0.0, padding='post', dtype=batch_d[0].dtype)
+        # Noisy option
+        if self.train:
+            batch_x_noisy = [self._mixing_snr(x) for x in batch_x]
+            pad_batch_x_noisy = pad_sequences(np.array(batch_x_noisy), maxlen=self.maxlen_a, value=0.0, padding='post', dtype=batch_x_noisy[0].dtype)
+        if self.train:
+            if self.features == 'both':
+                return pad_batch_x, pad_batch_x_noisy, pad_batch_p, pad_batch_e, pad_batch_z, pad_batch_l, pad_batch_t
+            else:
+                return pad_batch_x, pad_batch_x_noisy, pad_batch_y, pad_batch_z, pad_batch_l, pad_batch_t
+        else:
+            if self.features == 'both':
+                if self.edit_dist:
+                    return pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z, pad_batch_d
+                else:
+                    return pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z
+            else:
+                if self.edit_dist:
+                    return pad_batch_x, pad_batch_y, pad_batch_z, pad_batch_d
+                else:
+                    return pad_batch_x, pad_batch_y, pad_batch_z
+    def on_epoch_end(self):
+        self.indices = np.arange(self.len)
+        if self.shuffle == True:
+            np.random.shuffle(self.indices)
+def convert_sequence_to_dataset(dataloader):
+    def data_generator():
+        for i in range(dataloader.__len__()):
+            if dataloader.train:
+                if dataloader.features == 'both':
+                    pad_batch_x, pad_batch_x_noisy, pad_batch_p, pad_batch_e, pad_batch_z, pad_batch_l, pad_batch_t = dataloader[i]
+                    yield pad_batch_x, pad_batch_x_noisy, pad_batch_p, pad_batch_e, pad_batch_z, pad_batch_l, pad_batch_t
+                else:
+                    pad_batch_x, pad_batch_x_noisy, pad_batch_y, pad_batch_z, pad_batch_l, pad_batch_t = dataloader[i]
+                    yield pad_batch_x, pad_batch_x_noisy, pad_batch_y, pad_batch_z, pad_batch_l, pad_batch_t
+            else:
+                if dataloader.features == 'both':
+                    if dataloader.edit_dist:
+                        pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z, pad_batch_d = dataloader[i]
+                        yield pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z, pad_batch_d
+                    else:
+                        pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z = dataloader[i]
+                        yield pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z
+                else:
+                    if dataloader.edit_dist:
+                        pad_batch_x, pad_batch_y, pad_batch_z, pad_batch_d = dataloader[i]
+                        yield pad_batch_x, pad_batch_y, pad_batch_z, pad_batch_d
+                    else:
+                        pad_batch_x, pad_batch_y, pad_batch_z = dataloader[i]
+                        yield pad_batch_x, pad_batch_y, pad_batch_z
+    if dataloader.train:
+        if dataloader.features == 'both':
+            data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+                tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_t), dtype=tf.int32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_t, 256), dtype=tf.float32),
+                tf.TensorSpec(shape=(None, 1), dtype=tf.float32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_l), dtype=tf.int32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_t), dtype=tf.int32),)
+            )
+        else:
+            data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+                tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_t) if dataloader.features == 'phoneme' else (None, dataloader.maxlen_t, 256),
+                            dtype=tf.int32 if dataloader.features == 'phoneme' else tf.float32),
+                tf.TensorSpec(shape=(None, 1), dtype=tf.float32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_l), dtype=tf.int32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_t), dtype=tf.int32),)
+            )
+    else:
+        if dataloader.features == 'both':
+            if dataloader.edit_dist:
+                data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_t), dtype=tf.int32),
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_t, 256), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, 1), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+                )
+            else:
+                data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_t), dtype=tf.int32),
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_t, 256), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+                )
+        else:
+            if dataloader.edit_dist:
+                data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_t) if dataloader.features == 'phoneme' else (None, dataloader.maxlen_t, 256),
+                                dtype=tf.int32 if dataloader.features == 'phoneme' else tf.float32),
+                    tf.TensorSpec(shape=(None, 1), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+                )
+            else:
+                data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_t) if dataloader.features == 'phoneme' else (None, dataloader.maxlen_t, 256),
+                                dtype=tf.int32 if dataloader.features == 'phoneme' else tf.float32),
+                    tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+                )
+    # data_dataset = data_dataset.cache()
+    data_dataset = data_dataset.prefetch(1)
+    return data_dataset
+if __name__ == '__main__':
+    GLOBAL_BATCH_SIZE = 2048
+    train_dataset = LibriPhraseDataloader(batch_size=GLOBAL_BATCH_SIZE, train=True, types='both', shuffle=True, pkl='/home/DB/LibriPhrase/data/train_both.pkl', features='g2p_embed')
+    test_dataset = LibriPhraseDataloader(batch_size=GLOBAL_BATCH_SIZE, train=False, edit_dist=True, types='both', shuffle=False, pkl='/home/DB/LibriPhrase/data/test_both.pkl', features='g2p_embed')

dataset/libriphrase_ctc1.py ADDED Viewed

	@@ -0,0 +1,346 @@

+import math, os, re, sys
+from pathlib import Path
+import numpy as np
+import pandas as pd
+import Levenshtein
+from multiprocessing import Pool
+from scipy.io import wavfile
+import tensorflow as tf
+from tensorflow.keras.utils import Sequence, OrderedEnqueuer
+from tensorflow.keras import layers
+from tensorflow.keras.preprocessing.sequence import pad_sequences
+sys.path.append(os.path.dirname(__file__))
+from g2p.g2p_en.g2p import G2p
+import warnings
+warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
+np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
+class LibriPhraseDataloader(Sequence):
+    def __init__(self,
+                 batch_size,
+                 fs = 16000,
+                 wav_dir='/share/nas165/yiting/LibriPhrase/LibriPhrase_data',
+                 noise_dir='/share/nas165/yiting/EEND/corpora/JHU/musan/musan/noise/sound-bible',
+                 csv_dir='/share/nas165/yiting/LibriPhrase/data',
+                 train_csv = ['train100h','train_360h'],
+                 test_csv = ['train_500h',],
+                 types='both', # easy, hard
+                 features='g2p_embed', # phoneme, g2p_embed, both ...
+                 train=True,
+                 shuffle=True,
+                 pkl=None,
+                 edit_dist=False,
+                 ):
+        phonemes = ["<pad>", ] + ['AA0', 'AA1', 'AA2', 'AE0', 'AE1', 'AE2', 'AH0', 'AH1', 'AH2', 'AO0',
+                                    'AO1', 'AO2', 'AW0', 'AW1', 'AW2', 'AY0', 'AY1', 'AY2', 'B', 'CH',
+                                    'D', 'DH', 'EH0', 'EH1', 'EH2', 'ER0', 'ER1', 'ER2', 'EY0', 'EY1',
+                                    'EY2', 'F', 'G', 'HH', 'IH0', 'IH1', 'IH2', 'IY0', 'IY1', 'IY2',
+                                    'JH', 'K', 'L', 'M', 'N', 'NG', 'OW0', 'OW1', 'OW2', 'OY0',
+                                    'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH0', 'UH1',
+                                    'UH2', 'UW', 'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH',
+                                    ' ']
+        self.p2idx = {p: idx for idx, p in enumerate(phonemes)}
+        self.idx2p = {idx: p for idx, p in enumerate(phonemes)}
+        self.batch_size = batch_size
+        self.fs = fs
+        self.wav_dir = wav_dir
+        self.csv_dir = csv_dir
+        self.noise_dir = noise_dir
+        self.train_csv = train_csv
+        self.test_csv = test_csv
+        self.types = types
+        self.features = features
+        self.train = train
+        self.shuffle = shuffle
+        self.pkl = pkl
+        self.edit_dist = edit_dist
+        self.nPhoneme = len(phonemes)
+        self.g2p = G2p()
+        self.__prep__()
+        self.on_epoch_end()
+    def __prep__(self):
+        if self.train:
+            print(">> Preparing noise DB")
+            noise_list = [str(x) for x in Path(self.noise_dir).rglob('*.wav')]
+            self.noise = np.array([])
+            for noise in noise_list:
+                fs, data = wavfile.read(noise)
+                assert fs == self.fs, ">> Error : Un-match sampling freq.\n{} -> {}".format(noise, fs)
+                data = data.astype(np.float32) / 32768.0
+                data = (data / np.max(data)) * 0.5
+                self.noise = np.append(self.noise, data)
+        self.data = pd.DataFrame(columns=['wav_label', 'wav', 'text', 'duration', 'label', 'type'])
+        def process_text(self, x):
+                if isinstance(x, str):
+                    # Only apply re.sub if x is a string
+                    return re.sub(r"[^a-zA-Z0-9]+", ' ', x)
+                else:
+                    # Handle other cases, e.g., return x as is or convert to string
+                    return str(x)
+        if (self.pkl is not None) and (os.path.isfile(self.pkl)):
+            print(">> Load dataset from {}".format(self.pkl))
+            self.data = pd.read_pickle(self.pkl)
+        else:
+            for db in self.train_csv if self.train else self.test_csv:
+                csv_list = [str(x) for x in Path(self.csv_dir).rglob('*' + db + '*word*')]
+                for n_word in csv_list:
+                    print(">> processing : {} ".format(n_word))
+                    df = pd.read_csv(n_word)
+                    # Split train dataset to match & unmatch case
+                    anc_pos = df[['anchor_text', 'anchor', 'anchor_text', 'anchor_dur']]
+                    anc_neg = df[['anchor_text', 'anchor', 'comparison_text', 'anchor_dur', 'target', 'type']]
+                    com_pos = df[['comparison_text', 'comparison', 'comparison_text', 'comparison_dur']]
+                    com_neg = df[['comparison_text', 'comparison', 'anchor_text', 'comparison_dur', 'target', 'type']]
+                    anc_pos.columns = ['wav_label', 'anchor', 'anchor_text', 'anchor_dur']
+                    com_pos.columns = ['wav_label', 'comparison', 'comparison_text', 'comparison_dur']
+                    anc_pos['label'] = 1
+                    anc_pos['type'] = df['type']
+                    com_pos['label'] = 1
+                    com_pos['type'] = df['type']
+                    # Concat
+                    self.data = self.data.append(anc_pos.rename(columns={y: x for x, y in zip(self.data.columns, anc_pos.columns)}), ignore_index=True)
+                    self.data = self.data.append(anc_neg.rename(columns={y: x for x, y in zip(self.data.columns, anc_neg.columns)}), ignore_index=True)
+                    self.data = self.data.append(com_pos.rename(columns={y: x for x, y in zip(self.data.columns, com_pos.columns)}), ignore_index=True)
+                    self.data = self.data.append(com_neg.rename(columns={y: x for x, y in zip(self.data.columns, com_neg.columns)}), ignore_index=True)
+            # Append wav directory path
+            self.data['wav'] = self.data['wav'].apply(lambda x: os.path.join(self.wav_dir, x))
+            # g2p & p2idx by g2p_en package
+            print(">> Convert word to phoneme")
+            self.data['phoneme'] = self.data['text'].apply(lambda x: self.g2p(re.sub(r"[^a-zA-Z0-9]+", ' ', x)))
+            print(">> Convert speech word to phoneme")
+            self.data['wav_phoneme'] = self.data['wav_label'].apply(lambda x: self.g2p(re.sub(r"[^a-zA-Z0-9]+", ' ', x)))
+            print(">> Convert phoneme to index")
+            self.data['pIndex'] = self.data['phoneme'].apply(lambda x: [self.p2idx[t] for t in x])
+            print(">> Convert speech phoneme to index")
+            self.data['wav_pIndex'] = self.data['wav_phoneme'].apply(lambda x: [self.p2idx[t] for t in x])
+            print(">> Compute phoneme embedding")
+            self.data['g2p_embed'] = self.data['text'].apply(lambda x: self.g2p.embedding(x))
+            print('wav_label',self.data['wav_label'])
+            print('text',self.data['text'])
+            self.data['dist'] = self.data.apply(lambda x: Levenshtein.ratio(re.sub(r"[^a-zA-Z0-9]+", ' ', x['wav_label']), re.sub(r"[^a-zA-Z0-9]+", ' ', x['text'])), axis=1)
+            #備註解掉的地方
+            if (self.pkl is not None) and (not os.path.isfile(self.pkl)):
+                 self.data.to_pickle(self.pkl)
+        # Masking dataset type
+        if self.types == 'both':
+            pass
+        elif self.types == 'easy':
+            self.data = self.data.loc[self.data['type'] == 'diffspk_easyneg']
+        elif self.types == 'hard':
+            self.data = self.data.loc[self.data['type'] == 'diffspk_hardneg']
+        # Get longest data
+        self.data = self.data.sort_values(by='duration').reset_index(drop=True)
+        self.wav_list = self.data['wav'].values
+        self.idx_list = self.data['pIndex'].values
+        self.sIdx_list = self.data['wav_pIndex'].values
+        self.idx_list = [np.insert(lst, 0, 0) for lst in self.idx_list]
+        self.sIdx_list = [np.insert(lst, 0, 0) for lst in self.sIdx_list]
+        self.emb_list = self.data['g2p_embed'].values
+        self.lab_list = self.data['label'].values
+        if self.edit_dist:
+            self.dist_list = self.data['dist'].values
+        # Set dataloader params.
+        self.len = len(self.data)
+        self.maxlen_t = int((int(self.data['text'].apply(lambda x: len(x)).max() / 10) + 1) * 10)
+        self.maxlen_a = int((int(self.data['duration'].values[-1] / 0.5) + 1 ) * self.fs / 2)
+        self.maxlen_l = int((int(self.data['wav_label'].apply(lambda x: len(x)).max() / 10) + 1) * 10)
+    def __len__(self):
+        # return total batch-wise length
+        return math.ceil(self.len / self.batch_size)
+    def _load_wav(self, wav):
+        return np.array(wavfile.read(wav)[1]).astype(np.float32) / 32768.0
+    def _mixing_snr(self, clean, snr=[5, 15]):
+        def _cal_adjusted_rms(clean_rms, snr):
+            a = float(snr) / 20
+            noise_rms = clean_rms / (10**a)
+            return noise_rms
+        def _cal_rms(amp):
+            return np.sqrt(np.mean(np.square(amp), axis=-1))
+        start = np.random.randint(0, len(self.noise)-len(clean))
+        divided_noise = self.noise[start: start + len(clean)]
+        clean_rms = _cal_rms(clean)
+        noise_rms = _cal_rms(divided_noise)
+        adj_noise_rms = _cal_adjusted_rms(clean_rms, np.random.randint(snr[0], snr[1]))
+        adj_noise_amp = divided_noise * (adj_noise_rms / (noise_rms + 1e-7))
+        noisy = clean + adj_noise_amp
+        if np.max(noisy) > 1:
+            noisy = noisy / np.max(noisy)
+        return noisy
+    def __getitem__(self, idx):
+        # chunking
+        indices = self.indices[idx * self.batch_size : (idx + 1) * self.batch_size]
+        # load inputs
+        batch_x = [np.array(wavfile.read(self.wav_list[i])[1]).astype(np.float32) / 32768.0 for i in indices]
+        if self.features == 'both':
+            batch_p = [np.array(self.idx_list[i]).astype(np.int32) for i in indices]
+            batch_e = [np.array(self.emb_list[i]).astype(np.float32) for i in indices]
+        else:
+            if self.features == 'phoneme':
+                batch_y = [np.array(self.idx_list[i]).astype(np.int32) for i in indices]
+            elif self.features == 'g2p_embed':
+                batch_y = [np.array(self.emb_list[i]).astype(np.float32) for i in indices]
+        # load outputs
+        batch_z = [np.array([self.lab_list[i]]).astype(np.float32) for i in indices]
+        batch_l = [np.array(self.sIdx_list[i]).astype(np.int32) for i in indices]
+        batch_t = [np.array(self.idx_list[i]).astype(np.int32) for i in indices]
+        if self.edit_dist:
+            batch_d = [np.array([self.dist_list[i]]).astype(np.float32) for i in indices]
+        # padding and masking
+        pad_batch_x = pad_sequences(np.array(batch_x), maxlen=self.maxlen_a, value=0.0, padding='post', dtype=batch_x[0].dtype)
+        if self.features == 'both':
+            pad_batch_p = pad_sequences(np.array(batch_p), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_p[0].dtype)
+            pad_batch_e = pad_sequences(np.array(batch_e), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_e[0].dtype)
+        else:
+            pad_batch_y = pad_sequences(np.array(batch_y), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_y[0].dtype)
+        pad_batch_z = pad_sequences(np.array(batch_z), value=0.0, padding='post', dtype=batch_z[0].dtype)
+        pad_batch_l = pad_sequences(np.array(batch_l), maxlen=self.maxlen_l, value=0.0, padding='post', dtype=batch_l[0].dtype)
+        pad_batch_t = pad_sequences(np.array(batch_t), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_t[0].dtype)
+        if self.edit_dist:
+            pad_batch_d = pad_sequences(np.array(batch_d), value=0.0, padding='post', dtype=batch_d[0].dtype)
+        # Noisy option
+        if self.train:
+            batch_x_noisy = [self._mixing_snr(x) for x in batch_x]
+            pad_batch_x_noisy = pad_sequences(np.array(batch_x_noisy), maxlen=self.maxlen_a, value=0.0, padding='post', dtype=batch_x_noisy[0].dtype)
+        if self.train:
+            if self.features == 'both':
+                return pad_batch_x, pad_batch_x_noisy, pad_batch_p, pad_batch_e, pad_batch_z, pad_batch_l, pad_batch_t
+            else:
+                return pad_batch_x, pad_batch_x_noisy, pad_batch_y, pad_batch_z, pad_batch_l, pad_batch_t
+        else:
+            if self.features == 'both':
+                if self.edit_dist:
+                    return pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z, pad_batch_d
+                else:
+                    return pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z
+            else:
+                if self.edit_dist:
+                    return pad_batch_x, pad_batch_y, pad_batch_z, pad_batch_d
+                else:
+                    return pad_batch_x, pad_batch_y, pad_batch_z
+    def on_epoch_end(self):
+        self.indices = np.arange(self.len)
+        if self.shuffle == True:
+            np.random.shuffle(self.indices)
+def convert_sequence_to_dataset(dataloader):
+    def data_generator():
+        for i in range(dataloader.__len__()):
+            if dataloader.train:
+                if dataloader.features == 'both':
+                    pad_batch_x, pad_batch_x_noisy, pad_batch_p, pad_batch_e, pad_batch_z, pad_batch_l, pad_batch_t = dataloader[i]
+                    yield pad_batch_x, pad_batch_x_noisy, pad_batch_p, pad_batch_e, pad_batch_z, pad_batch_l, pad_batch_t
+                else:
+                    pad_batch_x, pad_batch_x_noisy, pad_batch_y, pad_batch_z, pad_batch_l, pad_batch_t = dataloader[i]
+                    yield pad_batch_x, pad_batch_x_noisy, pad_batch_y, pad_batch_z, pad_batch_l, pad_batch_t
+            else:
+                if dataloader.features == 'both':
+                    if dataloader.edit_dist:
+                        pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z, pad_batch_d = dataloader[i]
+                        yield pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z, pad_batch_d
+                    else:
+                        pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z = dataloader[i]
+                        yield pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z
+                else:
+                    if dataloader.edit_dist:
+                        pad_batch_x, pad_batch_y, pad_batch_z, pad_batch_d = dataloader[i]
+                        yield pad_batch_x, pad_batch_y, pad_batch_z, pad_batch_d
+                    else:
+                        pad_batch_x, pad_batch_y, pad_batch_z = dataloader[i]
+                        yield pad_batch_x, pad_batch_y, pad_batch_z
+    if dataloader.train:
+        if dataloader.features == 'both':
+            data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+                tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_t), dtype=tf.int32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_t, 256), dtype=tf.float32),
+                tf.TensorSpec(shape=(None, 1), dtype=tf.float32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_l), dtype=tf.int32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_t), dtype=tf.int32),)
+            )
+        else:
+            data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+                tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_t) if dataloader.features == 'phoneme' else (None, dataloader.maxlen_t, 256),
+                            dtype=tf.int32 if dataloader.features == 'phoneme' else tf.float32),
+                tf.TensorSpec(shape=(None, 1), dtype=tf.float32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_l), dtype=tf.int32),
+                tf.TensorSpec(shape=(None, dataloader.maxlen_t), dtype=tf.int32),)
+            )
+    else:
+        if dataloader.features == 'both':
+            if dataloader.edit_dist:
+                data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_t), dtype=tf.int32),
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_t, 256), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, 1), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+                )
+            else:
+                data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_t), dtype=tf.int32),
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_t, 256), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+                )
+        else:
+            if dataloader.edit_dist:
+                data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_t) if dataloader.features == 'phoneme' else (None, dataloader.maxlen_t, 256),
+                                dtype=tf.int32 if dataloader.features == 'phoneme' else tf.float32),
+                    tf.TensorSpec(shape=(None, 1), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+                )
+            else:
+                data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+                    tf.TensorSpec(shape=(None, dataloader.maxlen_t) if dataloader.features == 'phoneme' else (None, dataloader.maxlen_t, 256),
+                                dtype=tf.int32 if dataloader.features == 'phoneme' else tf.float32),
+                    tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+                )
+    # data_dataset = data_dataset.cache()
+    data_dataset = data_dataset.prefetch(1)
+    return data_dataset
+if __name__ == '__main__':
+    GLOBAL_BATCH_SIZE = 2048
+    train_dataset = LibriPhraseDataloader(batch_size=GLOBAL_BATCH_SIZE, train=True, types='both', shuffle=True, features='g2p_embed')
+    test_dataset = LibriPhraseDataloader(batch_size=GLOBAL_BATCH_SIZE, train=False, edit_dist=True, types='both', shuffle=False, features='g2p_embed')
+    train_dataset = LibriPhraseDataloader(batch_size=GLOBAL_BATCH_SIZE, train=True, types='both', shuffle=True, pkl='/share/nas165/yiting/PhonMatchNet/data/train_both.pkl', features='g2p_embed')
+    test_dataset = LibriPhraseDataloader(batch_size=GLOBAL_BATCH_SIZE, train=False, edit_dist=True, types='both', shuffle=False, pkl='/share/nas165/yiting/PhonMatchNet/data/test_both.pkl', features='g2p_embed')

dataset/qualcomm.py ADDED Viewed

	@@ -0,0 +1,180 @@

+import math, os, re, sys
+from pathlib import Path
+import numpy as np
+import pandas as pd
+from multiprocessing import Pool
+from scipy.io import wavfile
+import tensorflow as tf
+from tensorflow.keras.utils import Sequence, OrderedEnqueuer
+from tensorflow.keras import layers
+from tensorflow.keras.preprocessing.sequence import pad_sequences
+sys.path.append(os.path.dirname(__file__))
+from g2p.g2p_en.g2p import G2p
+import warnings
+warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
+np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
+class QualcommKeywordSpeechDataloader(Sequence):
+    def __init__(self,
+                 batch_size,
+                 fs = 16000,
+                 wav_dir='/home/DB/qualcomm_keyword_speech_dataset',
+                 target_list=['hey_android', 'hey_snapdragon', 'hi_galaxy', 'hi_lumina'],
+                 features='g2p_embed', # phoneme, g2p_embed, both ...
+                 shuffle=True,
+                 pkl=None,
+                 ):
+        phonemes = ["<pad>", ] + ['AA0', 'AA1', 'AA2', 'AE0', 'AE1', 'AE2', 'AH0', 'AH1', 'AH2', 'AO0',
+                                    'AO1', 'AO2', 'AW0', 'AW1', 'AW2', 'AY0', 'AY1', 'AY2', 'B', 'CH',
+                                    'D', 'DH', 'EH0', 'EH1', 'EH2', 'ER0', 'ER1', 'ER2', 'EY0', 'EY1',
+                                    'EY2', 'F', 'G', 'HH', 'IH0', 'IH1', 'IH2', 'IY0', 'IY1', 'IY2',
+                                    'JH', 'K', 'L', 'M', 'N', 'NG', 'OW0', 'OW1', 'OW2', 'OY0',
+                                    'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH0', 'UH1',
+                                    'UH2', 'UW', 'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH',
+                                    ' ']
+        self.p2idx = {p: idx for idx, p in enumerate(phonemes)}
+        self.idx2p = {idx: p for idx, p in enumerate(phonemes)}
+        self.batch_size = batch_size
+        self.fs = fs
+        self.wav_dir = wav_dir
+        self.target_list = target_list
+        self.features = features
+        self.shuffle = shuffle
+        self.pkl = pkl
+        self.nPhoneme = len(phonemes)
+        self.g2p = G2p()
+        self.__prep__()
+        self.on_epoch_end()
+    def __prep__(self):
+        self.data = pd.DataFrame(columns=['wav', 'text', 'duration', 'label'])
+        if (self.pkl is not None) and (os.path.isfile(self.pkl)):
+            print(">> Load dataset from {}".format(self.pkl))
+            self.data = pd.read_pickle(self.pkl)
+        else:
+            print(">> Make dataset from {}".format(self.wav_dir))
+            target_dict = {}
+            idx = 0
+            for target in self.target_list:
+                print(">> Extract from {}".format(target))
+                wav_list = [str(x) for x in Path(os.path.join(self.wav_dir, target)).rglob('*.wav')]
+                for wav in wav_list:
+                    anchor_text = wav.split('/')[-3].lower().replace('_', ' ')
+                    duration = float(wavfile.read(wav)[1].shape[-1])/self.fs
+                    for comparison_text in self.target_list:
+                        comparison_text = comparison_text.replace('_', ' ')
+                        label = 1 if anchor_text == comparison_text else 0
+                        target_dict[idx] = {
+                            'wav': wav,
+                            'text': comparison_text,
+                            'duration': duration,
+                            'label': label
+                            }
+                        idx += 1
+            self.data = self.data.append(pd.DataFrame.from_dict(target_dict, 'index'), ignore_index=True)
+            # g2p & p2idx by g2p_en package
+            print(">> Convert word to phoneme")
+            self.data['phoneme'] = self.data['text'].apply(lambda x: self.g2p(re.sub(r"[^a-zA-Z0-9]+", ' ', x)))
+            print(">> Convert phoneme to index")
+            self.data['pIndex'] = self.data['phoneme'].apply(lambda x: [self.p2idx[t] for t in x])
+            print(">> Compute phoneme embedding")
+            self.data['g2p_embed'] = self.data['text'].apply(lambda x: self.g2p.embedding(x))
+            if (self.pkl is not None) and (not os.path.isfile(self.pkl)):
+                self.data.to_pickle(self.pkl)
+        # Get longest data
+        self.data = self.data.sort_values(by='duration').reset_index(drop=True)
+        self.wav_list = self.data['wav'].values
+        self.idx_list = self.data['pIndex'].values
+        self.emb_list = self.data['g2p_embed'].values
+        self.lab_list = self.data['label'].values
+        # Set dataloader params.
+        self.len = len(self.data)
+        self.maxlen_t = int((int(self.data['text'].apply(lambda x: len(x)).max() / 10) + 1) * 10)
+        self.maxlen_a = int((int(self.data['duration'].values[-1] / 0.5) + 1 ) * self.fs / 2)
+    def __len__(self):
+        # return total batch-wise length
+        return math.ceil(self.len / self.batch_size)
+    def _load_wav(self, wav):
+        return np.array(wavfile.read(wav)[1]).astype(np.float32) / 32768.0
+    def __getitem__(self, idx):
+        # chunking
+        indices = self.indices[idx * self.batch_size : (idx + 1) * self.batch_size]
+        # load inputs
+        batch_x = [np.array(wavfile.read(self.wav_list[i])[1]).astype(np.float32) / 32768.0 for i in indices]
+        if self.features == 'both':
+            batch_p = [np.array(self.idx_list[i]).astype(np.int32) for i in indices]
+            batch_e = [np.array(self.emb_list[i]).astype(np.float32) for i in indices]
+        else:
+            if self.features == 'phoneme':
+                batch_y = [np.array(self.idx_list[i]).astype(np.int32) for i in indices]
+            elif self.features == 'g2p_embed':
+                batch_y = [np.array(self.emb_list[i]).astype(np.float32) for i in indices]
+        # load outputs
+        batch_z = [np.array([self.lab_list[i]]).astype(np.float32) for i in indices]
+        # padding and masking
+        pad_batch_x = pad_sequences(np.array(batch_x), maxlen=self.maxlen_a, value=0.0, padding='post', dtype=batch_x[0].dtype)
+        if self.features == 'both':
+            pad_batch_p = pad_sequences(np.array(batch_p), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_p[0].dtype)
+            pad_batch_e = pad_sequences(np.array(batch_e), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_e[0].dtype)
+        else:
+            pad_batch_y = pad_sequences(np.array(batch_y), maxlen=self.maxlen_t, value=0.0, padding='post', dtype=batch_y[0].dtype)
+        pad_batch_z = pad_sequences(np.array(batch_z), value=0.0, padding='post', dtype=batch_z[0].dtype)
+        if self.features == 'both':
+            return pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z
+        else:
+            return pad_batch_x, pad_batch_y, pad_batch_z
+    def on_epoch_end(self):
+        self.indices = np.arange(self.len)
+        if self.shuffle == True:
+            np.random.shuffle(self.indices)
+def convert_sequence_to_dataset(dataloader):
+    def data_generator():
+        for i in range(dataloader.__len__()):
+            if dataloader.features == 'both':
+                pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z = dataloader[i]
+                yield pad_batch_x, pad_batch_p, pad_batch_e, pad_batch_z
+            else:
+                pad_batch_x, pad_batch_y, pad_batch_z = dataloader[i]
+                yield pad_batch_x, pad_batch_y, pad_batch_z
+    if dataloader.features == 'both':
+        data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+            tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, dataloader.maxlen_t), dtype=tf.int32),
+            tf.TensorSpec(shape=(None, dataloader.maxlen_t, 256), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+        )
+    else:
+        data_dataset =  tf.data.Dataset.from_generator(data_generator, output_signature=(
+            tf.TensorSpec(shape=(None, dataloader.maxlen_a), dtype=tf.float32),
+            tf.TensorSpec(shape=(None, dataloader.maxlen_t) if dataloader.features == 'phoneme' else (None, dataloader.maxlen_t, 256),
+                        dtype=tf.int32 if dataloader.features == 'phoneme' else tf.float32),
+            tf.TensorSpec(shape=(None, 1), dtype=tf.float32),)
+        )
+    # data_dataset = data_dataset.cache()
+    data_dataset = data_dataset.prefetch(1)
+    return data_dataset
+if __name__ == '__main__':
+    dataloader = QualcommKeywordSpeechDataloader(2048, pkl='/home/DB/qualcomm_keyword_speech_dataset/qualcomm.pkl', features='g2p_embed')

demo.py ADDED Viewed

	@@ -0,0 +1,168 @@

+import os, warnings, argparse
+import tensorflow as tf
+import numpy as np
+from model import ukws
+from dataset import dataloader_demo
+import gradio as gr
+# import librosa
+os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
+tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
+warnings.filterwarnings('ignore')
+warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
+np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
+warnings.simplefilter("ignore")
+seed = 42
+tf.random.set_seed(seed)
+np.random.seed(seed)
+parser = argparse.ArgumentParser()
+parser.add_argument('--text_input', required=False, type=str, default='g2p_embed')
+parser.add_argument('--audio_input', required=False, type=str, default='both')
+parser.add_argument('--load_checkpoint_path', required=True, type=str)
+parser.add_argument('--keyword_list_length', required=True, type=int)
+parser.add_argument('--stack_extractor', action='store_true')
+parser.add_argument('--comment', required=False, type=str)
+args = parser.parse_args()
+gpus = tf.config.experimental.list_physical_devices('GPU')
+if gpus:
+    try:
+        for gpu in gpus:
+            tf.config.experimental.set_memory_growth(gpu, True)
+    except RuntimeError as e:
+        print(e)
+strategy = tf.distribute.MirroredStrategy()
+batch_size = args.keyword_list_length
+# Batch size per GPU
+GLOBAL_BATCH_SIZE = batch_size * strategy.num_replicas_in_sync
+# BATCH_SIZE_PER_REPLICA = GLOBAL_BATCH_SIZE / strategy.num_replicas_in_sync
+# Make Dataloader
+text_input = args.text_input
+audio_input = args.audio_input
+load_checkpoint_path = args.load_checkpoint_path
+phonemes = ["<pad>", ] + ['AA0', 'AA1', 'AA2', 'AE0', 'AE1', 'AE2', 'AH0', 'AH1', 'AH2', 'AO0',
+                                        'AO1', 'AO2', 'AW0', 'AW1', 'AW2', 'AY0', 'AY1', 'AY2', 'B', 'CH',
+                                        'D', 'DH', 'EH0', 'EH1', 'EH2', 'ER0', 'ER1', 'ER2', 'EY0', 'EY1',
+                                        'EY2', 'F', 'G', 'HH', 'IH0', 'IH1', 'IH2', 'IY0', 'IY1', 'IY2',
+                                        'JH', 'K', 'L', 'M', 'N', 'NG', 'OW0', 'OW1', 'OW2', 'OY0',
+                                        'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH0', 'UH1',
+                                        'UH2', 'UW', 'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH',
+                                        ' ']
+# Number of phonemes
+vocab = len(phonemes)
+# Model params.
+kwargs = {
+        'vocab' : vocab,
+        'text_input' : text_input,
+        'audio_input' : audio_input,
+        'frame_length' : 400,
+        'hop_length' : 160,
+        'num_mel'  : 40,
+        'sample_rate' : 16000,
+        'log_mel' : False,
+        'stack_extractor' : args.stack_extractor,
+    }
+# Make tensorboard dict.
+global keyword
+param = kwargs
+param['comment'] = args.comment
+with strategy.scope():
+    model = ukws.BaseUKWS(**kwargs)
+    if args.load_checkpoint_path:
+        checkpoint_dir=args.load_checkpoint_path
+        checkpoint = tf.train.Checkpoint(model=model)
+        checkpoint_manager = tf.train.CheckpointManager(checkpoint, checkpoint_dir, max_to_keep=5)
+        latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)
+        if latest_checkpoint:
+            checkpoint.restore(latest_checkpoint)
+            print("Checkpoint restored!")
+        else:
+            print("No checkpoint found.")
+    def inference(audio,keyword):
+        if isinstance(keyword, str):
+            keyword = [kw.strip() for kw in keyword.split(',')]
+        test_google_dataset = dataloader_demo.GoogleCommandsDataloader(batch_size=GLOBAL_BATCH_SIZE, features=text_input, wav_path_or_object=audio, keyword = keyword)
+        test_google_dataset = dataloader_demo.convert_sequence_to_dataset(test_google_dataset)
+        test_google_dist_dataset = strategy.experimental_distribute_dataset(test_google_dataset)
+        # @tf.function
+        def test_step_metric_only(inputs,keyword_list):
+            clean_speech = inputs[0]
+            text = inputs[1]
+            labels = inputs[2]
+            prob, affinity_matrix = model(clean_speech, text, training=False)[:2]
+            prob=tf.round(prob * 1000) / 1000
+            prob = prob.numpy().flatten()
+            max_indices = np.argmax(prob,axis=0)
+            if prob[max_indices] >= 0.8:
+                keyword = keyword_list[ max_indices]
+            else :
+                keyword = 'no keyword'
+            print('keyword:',keyword_list)
+            print('prob',prob)
+            msg = ''
+            for k, p in zip(keyword_list, prob):
+                msg += '{} | {:.2f} \n'.format(k, p)
+            return keyword, msg
+        for x in test_google_dist_dataset:
+            keyword, prob = test_step_metric_only(x,keyword)
+        return keyword, prob
+    # keyword = ['realtek go','ok google','vintage','hackney','crocodile','surroundings','oversaw','northwestern']
+    # audio = '/share/nas165/yiting/recording/ok_google/Default_20240725-183000.wav'
+    # inference(audio,keyword)
+    demo = gr.Interface(
+        fn=inference,
+        inputs=[gr.Audio(source="upload", label="Sound"),
+                gr.Textbox(placeholder="Keyword List Here...", label="keyword_list")],
+        examples=[
+        ["./recording/ok_google/ok_google-183000.wav", 'realtek go,ok google,vintage,hackney,crocodile,surroundings,oversaw,northwestern'],
+        ["./recording/ok_google/ok_google-183005.wav", 'realtek go,ok google,vintage,hackney,crocodile,surroundings,oversaw,northwestern'],
+        ["./recording/ok_google/ok_google-183008.wav", 'realtek go,ok google,vintage,hackney,crocodile,surroundings,oversaw,northwestern'],
+        ["./recording/ok_google/ok_google-183011.wav", 'realtek go,ok google,vintage,hackney,crocodile,surroundings,oversaw,northwestern'],
+        ["./recording/ok_google/ok_google-183015.wav", 'realtek go,ok google,vintage,hackney,crocodile,surroundings,oversaw,northwestern'],
+        ["./recording/realtek_go/realtek_go-183029.wav", 'realtek go,ok google,vintage,hackney,crocodile,surroundings,oversaw,northwestern'],
+        ["./recording/realtek_go/realtek_go-183033.wav", 'realtek go,ok google,vintage,hackney,crocodile,surroundings,oversaw,northwestern'],
+        ["./recording/realtek_go/realtek_go-183036.wav", 'realtek go,ok google,vintage,hackney,crocodile,surroundings,oversaw,northwestern'],
+        ["./recording/realtek_go/realtek_go-183039.wav", 'realtek go,ok google,vintage,hackney,crocodile,surroundings,oversaw,northwestern'],
+        ["./recording/realtek_go/realtek_go-183043.wav", 'realtek go,ok google,vintage,hackney,crocodile,surroundings,oversaw,northwestern'],
+        ],
+        outputs=[gr.Textbox(label="keyword"), gr.Textbox(label="Confidence Score of keyword")],
+        )
+    demo.launch(server_name='0.0.0.0', server_port=7860,share=True)

docker/Dockerfile ADDED Viewed

	@@ -0,0 +1,25 @@

+FROM tensorflow/tensorflow:2.4.1-gpu
+# Install dependency
+RUN apt-key adv --keyserver keyserver.ubuntu.com --recv A4B469963BF863CC
+RUN apt-get update -y && apt-get install -y \
+    git \
+    libsndfile1
+# Install python packages
+RUN python -m pip install --upgrade pip && pip install \
+    levenshtein \
+    six \
+    audioread \
+    librosa  \
+    PySoundFile \
+    scipy \
+    tqdm \
+    pandas \
+    nltk \
+    inflect
+RUN python -m pip uninstall -y numpy
+RUN python -m pip install numpy==1.18.5
+WORKDIR /home

flagged/Sound/c129aef35ba4cb66620f813cd7268c4be510a66d/ok_google-183000.wav ADDED Viewed

Binary file (96.3 kB). View file

flagged/Sound/d35a5cf80a9403828bc601a0a761a5f88da06f00/realtek_go-183033.wav ADDED Viewed

Binary file (101 kB). View file

flagged/log.csv ADDED Viewed

	@@ -0,0 +1,8 @@

+Sound,keyword_list,keyword,Confidence Score of keyword,flag,username,timestamp
+/share/nas165/yiting/CL-KWS_202408_v1/flagged/Sound/c129aef35ba4cb66620f813cd7268c4be510a66d/ok_google-183000.wav,"realtek go,ok google,vintage,hackney,crocodile,surroundings,oversaw,northwestern",,,,,2024-09-11 09:54:49.824521
+/share/nas165/yiting/CL-KWS_202408_v1/flagged/Sound/d35a5cf80a9403828bc601a0a761a5f88da06f00/realtek_go-183033.wav,"realtek go,ok google,vintage,hackney,crocodile,surroundings,oversaw,northwestern",ok google,"ok cortana | 0.11
+ok google | 0.97
+hey google | 0.46
+oh come google | 0.87
+ok gogo | 0.91
+",,,2024-09-11 10:23:11.972172

inference.py ADDED Viewed

	@@ -0,0 +1,141 @@

+import sys, os, datetime, warnings, argparse
+import tensorflow as tf
+import numpy as np
+from model import ukws
+from dataset import google_infe202405
+os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
+tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
+warnings.filterwarnings('ignore')
+warnings.filterwarnings("ignore", category=np.VisibleDeprecationWarning)
+np.warnings.filterwarnings('ignore', category=np.VisibleDeprecationWarning)
+warnings.simplefilter("ignore")
+seed = 42
+tf.random.set_seed(seed)
+np.random.seed(seed)
+parser = argparse.ArgumentParser()
+parser.add_argument('--text_input', required=False, type=str, default='g2p_embed')
+parser.add_argument('--audio_input', required=False, type=str, default='both')
+parser.add_argument('--load_checkpoint_path', required=True, type=str)
+parser.add_argument('--google_pkl', required=False, type=str, default='/home/DB/data/google_test_all.pkl')
+parser.add_argument('--stack_extractor', action='store_true')
+args = parser.parse_args()
+gpus = tf.config.experimental.list_physical_devices('GPU')
+if gpus:
+    try:
+        for gpu in gpus:
+            tf.config.experimental.set_memory_growth(gpu, True)
+    except RuntimeError as e:
+        print(e)
+strategy = tf.distribute.MirroredStrategy()
+# Batch size per GPU
+GLOBAL_BATCH_SIZE = 1000 * strategy.num_replicas_in_sync
+BATCH_SIZE_PER_REPLICA = GLOBAL_BATCH_SIZE / strategy.num_replicas_in_sync
+# Make Dataloader
+text_input = args.text_input
+audio_input = args.audio_input
+load_checkpoint_path = args.load_checkpoint_path
+test_google_dataset = google_infe202405.GoogleCommandsDataloader(batch_size=GLOBAL_BATCH_SIZE, features=text_input, shuffle=False, pkl=args.google_pkl)
+test_google_dataset = google_infe202405.convert_sequence_to_dataset(test_google_dataset)
+test_google_dist_dataset = strategy.experimental_distribute_dataset(test_google_dataset)
+phonemes = ["<pad>", ] + ['AA0', 'AA1', 'AA2', 'AE0', 'AE1', 'AE2', 'AH0', 'AH1', 'AH2', 'AO0',
+                                    'AO1', 'AO2', 'AW0', 'AW1', 'AW2', 'AY0', 'AY1', 'AY2', 'B', 'CH',
+                                    'D', 'DH', 'EH0', 'EH1', 'EH2', 'ER0', 'ER1', 'ER2', 'EY0', 'EY1',
+                                    'EY2', 'F', 'G', 'HH', 'IH0', 'IH1', 'IH2', 'IY0', 'IY1', 'IY2',
+                                    'JH', 'K', 'L', 'M', 'N', 'NG', 'OW0', 'OW1', 'OW2', 'OY0',
+                                    'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH0', 'UH1',
+                                    'UH2', 'UW', 'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH',
+                                    ' ']
+# Number of phonemes
+vocab = len(phonemes)
+# Model params.
+kwargs = {
+        'vocab' : vocab,
+        'text_input' : text_input,
+        'audio_input' : audio_input,
+        'frame_length' : 400,
+        'hop_length' : 160,
+        'num_mel'  : 40,
+        'sample_rate' : 16000,
+        'log_mel' : False,
+        'stack_extractor' : args.stack_extractor,
+    }
+# Make tensorboard dict.
+param = kwargs
+with strategy.scope():
+    model = ukws.BaseUKWS(**kwargs)
+    if args.load_checkpoint_path:
+        checkpoint_dir=args.load_checkpoint_path
+        checkpoint = tf.train.Checkpoint(model=model)
+        checkpoint_manager = tf.train.CheckpointManager(checkpoint, checkpoint_dir, max_to_keep=5)
+        latest_checkpoint = tf.train.latest_checkpoint(checkpoint_dir)
+        if latest_checkpoint:
+            checkpoint.restore(latest_checkpoint)
+            print("Checkpoint restored!")
+    # @tf.function
+    def test_step_metric_only(inputs):
+        clean_speech = inputs[0]
+        text = inputs[1]
+        labels = inputs[2]
+        prob = model(clean_speech, text, training=False)[0]
+        dim1=labels.shape[0]//20
+        prob = tf.reshape(prob,[dim1,20])
+        labels = tf.reshape(labels,[dim1,20])
+        predictions = tf.math.argmax(prob, axis=1)
+        actuals = tf.math.argmax(labels, axis=1)
+        true_count = tf.reduce_sum(tf.cast(tf.math.equal(predictions , actuals), tf.float32)).numpy()
+        num_testdata = dim1
+        return true_count, num_testdata
+    def distributed_test_step_metric_only(dataset_inputs):
+        true_count, num_testdata = strategy.run(test_step_metric_only, args=(dataset_inputs,))
+        return true_count, num_testdata
+    total_true_count = 0
+    total_num_testdata = 0
+    for x in test_google_dist_dataset:
+        true_count, num_testdata = distributed_test_step_metric_only(x)
+        total_true_count += true_count
+        total_num_testdata += num_testdata
+    accuracy = total_true_count / total_num_testdata * 100.0
+    print("準確率:", accuracy, "%")

model/__pycache__/discriminator.cpython-37.pyc ADDED Viewed

Binary file (2.35 kB). View file

model/__pycache__/encoder.cpython-37.pyc ADDED Viewed

Binary file (5.6 kB). View file

model/__pycache__/extractor.cpython-37.pyc ADDED Viewed

Binary file (3.82 kB). View file

model/__pycache__/log_melspectrogram.cpython-37.pyc ADDED Viewed

Binary file (2.17 kB). View file

model/__pycache__/speech_embedding.cpython-37.pyc ADDED Viewed

Binary file (1.75 kB). View file