DJQmUKV commited on
Commit
8b96836
·
1 Parent(s): a68da1e

init: real initial commit

Browse files

the "mian" branch... good.

LICENSE ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2023 liujing04
4
+ Copyright (c) 2023 源文雨
5
+ Copyright (c) 2023 on9.moe Webslaves
6
+
7
+ 本软件及其相关代码以MIT协议开源,作者不对软件具备任何控制力,使用软件者、传播软件导出的声音者自负全责。
8
+ 如不认可该条款,则不能使用或引用软件包内任何代码和文件。
9
+
10
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
15
+
16
+ 特此授予任何获得本软件和相关文档文件(以下简称“软件”)副本的人免费使用、复制、修改、合并、出版、分发、再授权和/或销售本软件的权利,以及授予本软件所提供的人使用本软件的权利,但须符合以下条件:
17
+ 上述版权声明和本许可声明应包含在软件的所有副本或实质部分中。
18
+ 软件是“按原样”提供的,没有任何明示或暗示的保证,包括但不限于适销性、适用于特定目的和不侵权的保证。在任何情况下,作者或版权持有人均不承担因软件或软件的使用或其他交易而产生、产生或与之相关的任何索赔、损害赔偿或其他责任,无论是在合同诉讼、侵权诉讼还是其他诉讼中。
19
+
20
+ 相关引用库协议如下:
21
+ #################
22
+ ContentVec
23
+ https://github.com/auspicious3000/contentvec/blob/main/LICENSE
24
+ MIT License
25
+ #################
26
+ VITS
27
+ https://github.com/jaywalnut310/vits/blob/main/LICENSE
28
+ MIT License
29
+ #################
30
+ HIFIGAN
31
+ https://github.com/jik876/hifi-gan/blob/master/LICENSE
32
+ MIT License
33
+ #################
34
+ gradio
35
+ https://github.com/gradio-app/gradio/blob/main/LICENSE
36
+ Apache License 2.0
37
+ #################
38
+ ffmpeg
39
+ https://github.com/FFmpeg/FFmpeg/blob/master/COPYING.LGPLv3
40
+ https://github.com/BtbN/FFmpeg-Builds/releases/download/autobuild-2021-02-28-12-32/ffmpeg-n4.3.2-160-gfbb9368226-win64-lgpl-4.3.zip
41
+ LPGLv3 License
42
+ MIT License
43
+ #################
44
+ ultimatevocalremovergui
45
+ https://github.com/Anjok07/ultimatevocalremovergui/blob/master/LICENSE
46
+ https://github.com/yang123qwe/vocal_separation_by_uvr5
47
+ MIT License
48
+ #################
49
+ audio-slicer
50
+ https://github.com/openvpi/audio-slicer/blob/main/LICENSE
51
+ MIT License
README.md CHANGED
@@ -1,13 +1,12 @@
1
  ---
2
- title: Rvc Inference
3
- emoji: 🌖
4
- colorFrom: green
5
- colorTo: purple
6
  sdk: gradio
7
  sdk_version: 3.28.3
8
- app_file: app.py
9
  pinned: false
10
  license: mit
11
  ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
  ---
2
+ title: RVC Inference
3
+ emoji: 🎙
4
+ colorFrom: pink
5
+ colorTo: green
6
  sdk: gradio
7
  sdk_version: 3.28.3
8
+ app_file: app_multi.py
9
  pinned: false
10
  license: mit
11
  ---
12
+ Great Value RVC models, quality and accuracy not guaranteed.
 
app_multi.py ADDED
@@ -0,0 +1,415 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Union
2
+
3
+ from argparse import ArgumentParser
4
+
5
+ import asyncio
6
+ import json
7
+ from os import path
8
+
9
+ import gradio as gr
10
+
11
+ import torch
12
+
13
+ import numpy as np
14
+ import librosa
15
+
16
+ import edge_tts
17
+
18
+ from config import device
19
+ import util
20
+ from infer_pack.models import (
21
+ SynthesizerTrnMs256NSFsid,
22
+ SynthesizerTrnMs256NSFsid_nono
23
+ )
24
+ from vc_infer_pipeline import VC
25
+
26
+
27
+ # Argument parsing
28
+ arg_parser = ArgumentParser()
29
+ arg_parser.add_argument(
30
+ '--hubert',
31
+ default='hubert_base.pt',
32
+ help='path to hubert base model (default: hubert_base.pt)'
33
+ )
34
+ arg_parser.add_argument(
35
+ '--config',
36
+ default='multi_config.json',
37
+ help='path to config file (default: multi_config.json)'
38
+ )
39
+ arg_parser.add_argument(
40
+ '--bind',
41
+ default='127.0.0.1',
42
+ help='gradio server listen address (default: 127.0.0.1)'
43
+ )
44
+ arg_parser.add_argument(
45
+ '--port',
46
+ default=7860,
47
+ help='gradio server listen port (default: 7860)'
48
+ )
49
+ arg_parser.add_argument(
50
+ '--share',
51
+ action='store_true',
52
+ help='let gradio create a public link for you'
53
+ )
54
+ arg_parser.add_argument(
55
+ '--api',
56
+ action='store_true',
57
+ help='enable api endpoint'
58
+ )
59
+ arg_parser.add_argument(
60
+ '--cache-examples',
61
+ action='store_true',
62
+ help='enable example caching, please remember delete gradio_cached_examples folder when example config has been modified' # noqa
63
+ )
64
+ args = arg_parser.parse_args()
65
+
66
+ app_css = '''
67
+ #model_info img {
68
+ max-width: 100px;
69
+ max-height: 100px;
70
+ float: right;
71
+ }
72
+
73
+ #model_info p {
74
+ margin: unset;
75
+ }
76
+ '''
77
+
78
+ app = gr.Blocks(
79
+ theme=gr.themes.Glass(),
80
+ css=app_css,
81
+ analytics_enabled=False
82
+ )
83
+
84
+ # Load hubert model
85
+ hubert_model = util.load_hubert_model(device, args.hubert)
86
+ hubert_model.eval()
87
+
88
+ # Load models
89
+ multi_cfg = json.load(open(args.config, 'r'))
90
+ loaded_models = []
91
+
92
+ for model_name in multi_cfg.get('models'):
93
+ print(f'Loading model: {model_name}')
94
+
95
+ # Load model info
96
+ model_info = json.load(
97
+ open(path.join('model', model_name, 'config.json'), 'r')
98
+ )
99
+
100
+ # Load RVC checkpoint
101
+ cpt = torch.load(
102
+ path.join('model', model_name, model_info['model']),
103
+ map_location='cpu'
104
+ )
105
+ tgt_sr = cpt['config'][-1]
106
+ cpt['config'][-3] = cpt['weight']['emb_g.weight'].shape[0] # n_spk
107
+
108
+ if_f0 = cpt.get('f0', 1)
109
+ net_g: Union[SynthesizerTrnMs256NSFsid, SynthesizerTrnMs256NSFsid_nono]
110
+ if if_f0 == 1:
111
+ net_g = SynthesizerTrnMs256NSFsid(
112
+ *cpt['config'],
113
+ is_half=util.is_half(device)
114
+ )
115
+ else:
116
+ net_g = SynthesizerTrnMs256NSFsid_nono(*cpt['config'])
117
+
118
+ del net_g.enc_q
119
+
120
+ # According to original code, this thing seems necessary.
121
+ print(net_g.load_state_dict(cpt['weight'], strict=False))
122
+
123
+ net_g.eval().to(device)
124
+ net_g = net_g.half() if util.is_half(device) else net_g.float()
125
+
126
+ vc = VC(tgt_sr, device, util.is_half(device))
127
+
128
+ loaded_models.append(dict(
129
+ name=model_name,
130
+ metadata=model_info,
131
+ vc=vc,
132
+ net_g=net_g,
133
+ if_f0=if_f0,
134
+ target_sr=tgt_sr
135
+ ))
136
+
137
+ print(f'Models loaded: {len(loaded_models)}')
138
+
139
+ # Edge TTS speakers
140
+ tts_speakers_list = asyncio.run(edge_tts.list_voices())
141
+
142
+
143
+ # https://github.com/fumiama/Retrieval-based-Voice-Conversion-WebUI/blob/main/infer-web.py#L118 # noqa
144
+ def vc_func(input_audio, model_index, pitch_adjust, f0_method, feat_ratio):
145
+ if input_audio is None:
146
+ return (None, 'Please provide input audio.')
147
+
148
+ if model_index is None:
149
+ return (None, 'Please select a model.')
150
+
151
+ model = loaded_models[model_index]
152
+
153
+ # Reference: so-vits
154
+ (audio_samp, audio_npy) = input_audio
155
+ # Bloody hell: https://stackoverflow.com/questions/26921836/
156
+ if audio_npy.dtype != np.float32: # :thonk:
157
+ audio_npy = (
158
+ audio_npy / np.iinfo(audio_npy.dtype).max
159
+ ).astype(np.float32)
160
+
161
+ if len(audio_npy.shape) > 1:
162
+ audio_npy = librosa.to_mono(audio_npy.transpose(1, 0))
163
+
164
+ if audio_samp != 16000:
165
+ audio_npy = librosa.resample(
166
+ audio_npy,
167
+ orig_sr=audio_samp,
168
+ target_sr=16000
169
+ )
170
+
171
+ pitch_int = int(pitch_adjust)
172
+
173
+ times = [0, 0, 0]
174
+ output_audio = model['vc'].pipeline(
175
+ hubert_model,
176
+ model['net_g'],
177
+ model['metadata'].get('speaker_id', 0),
178
+ audio_npy,
179
+ times,
180
+ pitch_int,
181
+ f0_method,
182
+ path.join('model', model['name'], model['metadata']['feat_index']),
183
+ path.join('model', model['name'], model['metadata']['feat_npy']),
184
+ feat_ratio,
185
+ model['if_f0']
186
+ )
187
+
188
+ print(f'npy: {times[0]}s, f0: {times[1]}s, infer: {times[2]}s')
189
+ return ((model['target_sr'], output_audio), 'Success')
190
+
191
+
192
+ async def edge_tts_vc_func(
193
+ input_text, model_index, tts_speaker, pitch_adjust, f0_method, feat_ratio
194
+ ):
195
+ if input_text is None:
196
+ return (None, 'Please provide TTS text.')
197
+
198
+ if tts_speaker is None:
199
+ return (None, 'Please select TTS speaker.')
200
+
201
+ if model_index is None:
202
+ return (None, 'Please select a model.')
203
+
204
+ speaker = tts_speakers_list[tts_speaker]['ShortName']
205
+ (tts_np, tts_sr) = await util.call_edge_tts(speaker, input_text)
206
+ return vc_func(
207
+ (tts_sr, tts_np),
208
+ model_index,
209
+ pitch_adjust,
210
+ f0_method,
211
+ feat_ratio
212
+ )
213
+
214
+
215
+ def update_model_info(model_index):
216
+ if model_index is None:
217
+ return str(
218
+ '### Model info\n'
219
+ 'Please select a model from dropdown above.'
220
+ )
221
+
222
+ model = loaded_models[model_index]
223
+ model_icon = model['metadata'].get('icon', '')
224
+
225
+ return str(
226
+ '### Model info\n'
227
+ '![model icon]({icon})'
228
+ '**{name}**\n\n'
229
+ 'Author: {author}\n\n'
230
+ 'Source: {source}\n\n'
231
+ '{note}'
232
+ ).format(
233
+ name=model['metadata'].get('name'),
234
+ author=model['metadata'].get('author', 'Anonymous'),
235
+ source=model['metadata'].get('source', 'Unknown'),
236
+ note=model['metadata'].get('note', ''),
237
+ icon=(
238
+ model_icon
239
+ if model_icon.startswith(('http://', 'https://'))
240
+ else '/file/model/%s/%s' % (model['name'], model_icon)
241
+ )
242
+ )
243
+
244
+
245
+ def _example_vc(input_audio, model_index, pitch_adjust, f0_method, feat_ratio):
246
+ (audio, message) = vc_func(
247
+ input_audio, model_index, pitch_adjust, f0_method, feat_ratio
248
+ )
249
+ return (
250
+ audio,
251
+ message,
252
+ update_model_info(model_index)
253
+ )
254
+
255
+
256
+ async def _example_edge_tts(
257
+ input_text, model_index, tts_speaker, pitch_adjust, f0_method, feat_ratio
258
+ ):
259
+ (audio, message) = await edge_tts_vc_func(
260
+ input_text, model_index, tts_speaker, pitch_adjust, f0_method,
261
+ feat_ratio
262
+ )
263
+ return (
264
+ audio,
265
+ message,
266
+ update_model_info(model_index)
267
+ )
268
+
269
+
270
+ with app:
271
+ gr.Markdown(
272
+ '## Simple, Stupid RVC Inference WebUI\n'
273
+ 'Another RVC inference WebUI based on [RVC-WebUI](https://github.com/fumiama/Retrieval-based-Voice-Conversion-WebUI), ' # noqa
274
+ 'some code and features inspired from so-vits and [zomehwh/rvc-models](https://huggingface.co/spaces/zomehwh/rvc-models).\n' # noqa
275
+ )
276
+
277
+ with gr.Row():
278
+ with gr.Column():
279
+ with gr.Tab('Audio conversion'):
280
+ input_audio = gr.Audio(label='Input audio')
281
+
282
+ vc_convert_btn = gr.Button('Convert', variant='primary')
283
+
284
+ with gr.Tab('TTS conversion'):
285
+ tts_input = gr.TextArea(
286
+ label='TTS input text'
287
+ )
288
+ tts_speaker = gr.Dropdown(
289
+ [
290
+ '%s (%s)' % (
291
+ s['FriendlyName'],
292
+ s['Gender']
293
+ )
294
+ for s in tts_speakers_list
295
+ ],
296
+ label='TTS speaker',
297
+ type='index'
298
+ )
299
+
300
+ tts_convert_btn = gr.Button('Convert', variant='primary')
301
+
302
+ pitch_adjust = gr.Slider(
303
+ label='Pitch',
304
+ minimum=-24,
305
+ maximum=24,
306
+ step=1,
307
+ value=0
308
+ )
309
+ f0_method = gr.Radio(
310
+ label='f0 methods',
311
+ choices=['pm', 'harvest'],
312
+ value='pm',
313
+ interactive=True
314
+ )
315
+ feat_ratio = gr.Slider(
316
+ label='Feature ratio',
317
+ minimum=0,
318
+ maximum=1,
319
+ step=0.1,
320
+ value=0.6
321
+ )
322
+
323
+ with gr.Column():
324
+ # Model select
325
+ model_index = gr.Dropdown(
326
+ [
327
+ '%s - %s' % (
328
+ m['metadata'].get('source', 'Unknown'),
329
+ m['metadata'].get('name')
330
+ )
331
+ for m in loaded_models
332
+ ],
333
+ label='Model',
334
+ type='index'
335
+ )
336
+
337
+ # Model info
338
+ with gr.Box():
339
+ model_info = gr.Markdown(
340
+ '### Model info\n'
341
+ 'Please select a model from dropdown above.',
342
+ elem_id='model_info'
343
+ )
344
+
345
+ output_audio = gr.Audio(label='Output audio')
346
+ output_msg = gr.Textbox(label='Output message')
347
+
348
+ multi_examples = multi_cfg.get('examples')
349
+ if multi_examples:
350
+ with gr.Accordion('Sweet sweet examples', open=False):
351
+ with gr.Row():
352
+ # VC Example
353
+ if multi_examples.get('vc'):
354
+ gr.Examples(
355
+ label='Audio conversion examples',
356
+ examples=multi_examples.get('vc'),
357
+ inputs=[
358
+ input_audio, model_index, pitch_adjust, f0_method,
359
+ feat_ratio
360
+ ],
361
+ outputs=[output_audio, output_msg, model_info],
362
+ fn=_example_vc,
363
+ cache_examples=args.cache_examples,
364
+ run_on_click=args.cache_examples
365
+ )
366
+
367
+ # Edge TTS Example
368
+ if multi_examples.get('tts_vc'):
369
+ gr.Examples(
370
+ label='TTS conversion examples',
371
+ examples=multi_examples.get('tts_vc'),
372
+ inputs=[
373
+ tts_input, model_index, tts_speaker, pitch_adjust,
374
+ f0_method, feat_ratio
375
+ ],
376
+ outputs=[output_audio, output_msg, model_info],
377
+ fn=_example_edge_tts,
378
+ cache_examples=args.cache_examples,
379
+ run_on_click=args.cache_examples
380
+ )
381
+
382
+ vc_convert_btn.click(
383
+ vc_func,
384
+ [input_audio, model_index, pitch_adjust, f0_method, feat_ratio],
385
+ [output_audio, output_msg],
386
+ api_name='audio_conversion'
387
+ )
388
+
389
+ tts_convert_btn.click(
390
+ edge_tts_vc_func,
391
+ [
392
+ tts_input, model_index, tts_speaker, pitch_adjust, f0_method,
393
+ feat_ratio
394
+ ],
395
+ [output_audio, output_msg],
396
+ api_name='tts_conversion'
397
+ )
398
+
399
+ model_index.change(
400
+ update_model_info,
401
+ inputs=[model_index],
402
+ outputs=[model_info],
403
+ show_progress=False,
404
+ queue=False
405
+ )
406
+
407
+ app.queue(
408
+ concurrency_count=1,
409
+ max_size=20,
410
+ api_open=args.api
411
+ ).launch(
412
+ server_name=args.bind,
413
+ server_port=args.port,
414
+ share=args.share
415
+ )
config.py ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+
3
+ import util
4
+
5
+
6
+ device = (
7
+ 'cuda:0' if torch.cuda.is_available()
8
+ else (
9
+ 'mps' if util.has_mps()
10
+ else 'cpu'
11
+ )
12
+ )
13
+
14
+ x_pad = 3 if util.is_half(device) else 1
15
+ x_query = 10 if util.is_half(device) else 6
16
+ x_center = 60 if util.is_half(device) else 38
17
+ x_max = 65 if util.is_half(device) else 41
hubert_base.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f54b40fd2802423a5643779c4861af1e9ee9c1564dc9d32f54f20b5ffba7db96
3
+ size 189507909
infer_pack/attentions.py ADDED
@@ -0,0 +1,417 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import copy
2
+ import math
3
+ import numpy as np
4
+ import torch
5
+ from torch import nn
6
+ from torch.nn import functional as F
7
+
8
+ from infer_pack import commons
9
+ from infer_pack import modules
10
+ from infer_pack.modules import LayerNorm
11
+
12
+
13
+ class Encoder(nn.Module):
14
+ def __init__(
15
+ self,
16
+ hidden_channels,
17
+ filter_channels,
18
+ n_heads,
19
+ n_layers,
20
+ kernel_size=1,
21
+ p_dropout=0.0,
22
+ window_size=10,
23
+ **kwargs
24
+ ):
25
+ super().__init__()
26
+ self.hidden_channels = hidden_channels
27
+ self.filter_channels = filter_channels
28
+ self.n_heads = n_heads
29
+ self.n_layers = n_layers
30
+ self.kernel_size = kernel_size
31
+ self.p_dropout = p_dropout
32
+ self.window_size = window_size
33
+
34
+ self.drop = nn.Dropout(p_dropout)
35
+ self.attn_layers = nn.ModuleList()
36
+ self.norm_layers_1 = nn.ModuleList()
37
+ self.ffn_layers = nn.ModuleList()
38
+ self.norm_layers_2 = nn.ModuleList()
39
+ for i in range(self.n_layers):
40
+ self.attn_layers.append(
41
+ MultiHeadAttention(
42
+ hidden_channels,
43
+ hidden_channels,
44
+ n_heads,
45
+ p_dropout=p_dropout,
46
+ window_size=window_size,
47
+ )
48
+ )
49
+ self.norm_layers_1.append(LayerNorm(hidden_channels))
50
+ self.ffn_layers.append(
51
+ FFN(
52
+ hidden_channels,
53
+ hidden_channels,
54
+ filter_channels,
55
+ kernel_size,
56
+ p_dropout=p_dropout,
57
+ )
58
+ )
59
+ self.norm_layers_2.append(LayerNorm(hidden_channels))
60
+
61
+ def forward(self, x, x_mask):
62
+ attn_mask = x_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
63
+ x = x * x_mask
64
+ for i in range(self.n_layers):
65
+ y = self.attn_layers[i](x, x, attn_mask)
66
+ y = self.drop(y)
67
+ x = self.norm_layers_1[i](x + y)
68
+
69
+ y = self.ffn_layers[i](x, x_mask)
70
+ y = self.drop(y)
71
+ x = self.norm_layers_2[i](x + y)
72
+ x = x * x_mask
73
+ return x
74
+
75
+
76
+ class Decoder(nn.Module):
77
+ def __init__(
78
+ self,
79
+ hidden_channels,
80
+ filter_channels,
81
+ n_heads,
82
+ n_layers,
83
+ kernel_size=1,
84
+ p_dropout=0.0,
85
+ proximal_bias=False,
86
+ proximal_init=True,
87
+ **kwargs
88
+ ):
89
+ super().__init__()
90
+ self.hidden_channels = hidden_channels
91
+ self.filter_channels = filter_channels
92
+ self.n_heads = n_heads
93
+ self.n_layers = n_layers
94
+ self.kernel_size = kernel_size
95
+ self.p_dropout = p_dropout
96
+ self.proximal_bias = proximal_bias
97
+ self.proximal_init = proximal_init
98
+
99
+ self.drop = nn.Dropout(p_dropout)
100
+ self.self_attn_layers = nn.ModuleList()
101
+ self.norm_layers_0 = nn.ModuleList()
102
+ self.encdec_attn_layers = nn.ModuleList()
103
+ self.norm_layers_1 = nn.ModuleList()
104
+ self.ffn_layers = nn.ModuleList()
105
+ self.norm_layers_2 = nn.ModuleList()
106
+ for i in range(self.n_layers):
107
+ self.self_attn_layers.append(
108
+ MultiHeadAttention(
109
+ hidden_channels,
110
+ hidden_channels,
111
+ n_heads,
112
+ p_dropout=p_dropout,
113
+ proximal_bias=proximal_bias,
114
+ proximal_init=proximal_init,
115
+ )
116
+ )
117
+ self.norm_layers_0.append(LayerNorm(hidden_channels))
118
+ self.encdec_attn_layers.append(
119
+ MultiHeadAttention(
120
+ hidden_channels, hidden_channels, n_heads, p_dropout=p_dropout
121
+ )
122
+ )
123
+ self.norm_layers_1.append(LayerNorm(hidden_channels))
124
+ self.ffn_layers.append(
125
+ FFN(
126
+ hidden_channels,
127
+ hidden_channels,
128
+ filter_channels,
129
+ kernel_size,
130
+ p_dropout=p_dropout,
131
+ causal=True,
132
+ )
133
+ )
134
+ self.norm_layers_2.append(LayerNorm(hidden_channels))
135
+
136
+ def forward(self, x, x_mask, h, h_mask):
137
+ """
138
+ x: decoder input
139
+ h: encoder output
140
+ """
141
+ self_attn_mask = commons.subsequent_mask(x_mask.size(2)).to(
142
+ device=x.device, dtype=x.dtype
143
+ )
144
+ encdec_attn_mask = h_mask.unsqueeze(2) * x_mask.unsqueeze(-1)
145
+ x = x * x_mask
146
+ for i in range(self.n_layers):
147
+ y = self.self_attn_layers[i](x, x, self_attn_mask)
148
+ y = self.drop(y)
149
+ x = self.norm_layers_0[i](x + y)
150
+
151
+ y = self.encdec_attn_layers[i](x, h, encdec_attn_mask)
152
+ y = self.drop(y)
153
+ x = self.norm_layers_1[i](x + y)
154
+
155
+ y = self.ffn_layers[i](x, x_mask)
156
+ y = self.drop(y)
157
+ x = self.norm_layers_2[i](x + y)
158
+ x = x * x_mask
159
+ return x
160
+
161
+
162
+ class MultiHeadAttention(nn.Module):
163
+ def __init__(
164
+ self,
165
+ channels,
166
+ out_channels,
167
+ n_heads,
168
+ p_dropout=0.0,
169
+ window_size=None,
170
+ heads_share=True,
171
+ block_length=None,
172
+ proximal_bias=False,
173
+ proximal_init=False,
174
+ ):
175
+ super().__init__()
176
+ assert channels % n_heads == 0
177
+
178
+ self.channels = channels
179
+ self.out_channels = out_channels
180
+ self.n_heads = n_heads
181
+ self.p_dropout = p_dropout
182
+ self.window_size = window_size
183
+ self.heads_share = heads_share
184
+ self.block_length = block_length
185
+ self.proximal_bias = proximal_bias
186
+ self.proximal_init = proximal_init
187
+ self.attn = None
188
+
189
+ self.k_channels = channels // n_heads
190
+ self.conv_q = nn.Conv1d(channels, channels, 1)
191
+ self.conv_k = nn.Conv1d(channels, channels, 1)
192
+ self.conv_v = nn.Conv1d(channels, channels, 1)
193
+ self.conv_o = nn.Conv1d(channels, out_channels, 1)
194
+ self.drop = nn.Dropout(p_dropout)
195
+
196
+ if window_size is not None:
197
+ n_heads_rel = 1 if heads_share else n_heads
198
+ rel_stddev = self.k_channels**-0.5
199
+ self.emb_rel_k = nn.Parameter(
200
+ torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels)
201
+ * rel_stddev
202
+ )
203
+ self.emb_rel_v = nn.Parameter(
204
+ torch.randn(n_heads_rel, window_size * 2 + 1, self.k_channels)
205
+ * rel_stddev
206
+ )
207
+
208
+ nn.init.xavier_uniform_(self.conv_q.weight)
209
+ nn.init.xavier_uniform_(self.conv_k.weight)
210
+ nn.init.xavier_uniform_(self.conv_v.weight)
211
+ if proximal_init:
212
+ with torch.no_grad():
213
+ self.conv_k.weight.copy_(self.conv_q.weight)
214
+ self.conv_k.bias.copy_(self.conv_q.bias)
215
+
216
+ def forward(self, x, c, attn_mask=None):
217
+ q = self.conv_q(x)
218
+ k = self.conv_k(c)
219
+ v = self.conv_v(c)
220
+
221
+ x, self.attn = self.attention(q, k, v, mask=attn_mask)
222
+
223
+ x = self.conv_o(x)
224
+ return x
225
+
226
+ def attention(self, query, key, value, mask=None):
227
+ # reshape [b, d, t] -> [b, n_h, t, d_k]
228
+ b, d, t_s, t_t = (*key.size(), query.size(2))
229
+ query = query.view(b, self.n_heads, self.k_channels, t_t).transpose(2, 3)
230
+ key = key.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
231
+ value = value.view(b, self.n_heads, self.k_channels, t_s).transpose(2, 3)
232
+
233
+ scores = torch.matmul(query / math.sqrt(self.k_channels), key.transpose(-2, -1))
234
+ if self.window_size is not None:
235
+ assert (
236
+ t_s == t_t
237
+ ), "Relative attention is only available for self-attention."
238
+ key_relative_embeddings = self._get_relative_embeddings(self.emb_rel_k, t_s)
239
+ rel_logits = self._matmul_with_relative_keys(
240
+ query / math.sqrt(self.k_channels), key_relative_embeddings
241
+ )
242
+ scores_local = self._relative_position_to_absolute_position(rel_logits)
243
+ scores = scores + scores_local
244
+ if self.proximal_bias:
245
+ assert t_s == t_t, "Proximal bias is only available for self-attention."
246
+ scores = scores + self._attention_bias_proximal(t_s).to(
247
+ device=scores.device, dtype=scores.dtype
248
+ )
249
+ if mask is not None:
250
+ scores = scores.masked_fill(mask == 0, -1e4)
251
+ if self.block_length is not None:
252
+ assert (
253
+ t_s == t_t
254
+ ), "Local attention is only available for self-attention."
255
+ block_mask = (
256
+ torch.ones_like(scores)
257
+ .triu(-self.block_length)
258
+ .tril(self.block_length)
259
+ )
260
+ scores = scores.masked_fill(block_mask == 0, -1e4)
261
+ p_attn = F.softmax(scores, dim=-1) # [b, n_h, t_t, t_s]
262
+ p_attn = self.drop(p_attn)
263
+ output = torch.matmul(p_attn, value)
264
+ if self.window_size is not None:
265
+ relative_weights = self._absolute_position_to_relative_position(p_attn)
266
+ value_relative_embeddings = self._get_relative_embeddings(
267
+ self.emb_rel_v, t_s
268
+ )
269
+ output = output + self._matmul_with_relative_values(
270
+ relative_weights, value_relative_embeddings
271
+ )
272
+ output = (
273
+ output.transpose(2, 3).contiguous().view(b, d, t_t)
274
+ ) # [b, n_h, t_t, d_k] -> [b, d, t_t]
275
+ return output, p_attn
276
+
277
+ def _matmul_with_relative_values(self, x, y):
278
+ """
279
+ x: [b, h, l, m]
280
+ y: [h or 1, m, d]
281
+ ret: [b, h, l, d]
282
+ """
283
+ ret = torch.matmul(x, y.unsqueeze(0))
284
+ return ret
285
+
286
+ def _matmul_with_relative_keys(self, x, y):
287
+ """
288
+ x: [b, h, l, d]
289
+ y: [h or 1, m, d]
290
+ ret: [b, h, l, m]
291
+ """
292
+ ret = torch.matmul(x, y.unsqueeze(0).transpose(-2, -1))
293
+ return ret
294
+
295
+ def _get_relative_embeddings(self, relative_embeddings, length):
296
+ max_relative_position = 2 * self.window_size + 1
297
+ # Pad first before slice to avoid using cond ops.
298
+ pad_length = max(length - (self.window_size + 1), 0)
299
+ slice_start_position = max((self.window_size + 1) - length, 0)
300
+ slice_end_position = slice_start_position + 2 * length - 1
301
+ if pad_length > 0:
302
+ padded_relative_embeddings = F.pad(
303
+ relative_embeddings,
304
+ commons.convert_pad_shape([[0, 0], [pad_length, pad_length], [0, 0]]),
305
+ )
306
+ else:
307
+ padded_relative_embeddings = relative_embeddings
308
+ used_relative_embeddings = padded_relative_embeddings[
309
+ :, slice_start_position:slice_end_position
310
+ ]
311
+ return used_relative_embeddings
312
+
313
+ def _relative_position_to_absolute_position(self, x):
314
+ """
315
+ x: [b, h, l, 2*l-1]
316
+ ret: [b, h, l, l]
317
+ """
318
+ batch, heads, length, _ = x.size()
319
+ # Concat columns of pad to shift from relative to absolute indexing.
320
+ x = F.pad(x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, 1]]))
321
+
322
+ # Concat extra elements so to add up to shape (len+1, 2*len-1).
323
+ x_flat = x.view([batch, heads, length * 2 * length])
324
+ x_flat = F.pad(
325
+ x_flat, commons.convert_pad_shape([[0, 0], [0, 0], [0, length - 1]])
326
+ )
327
+
328
+ # Reshape and slice out the padded elements.
329
+ x_final = x_flat.view([batch, heads, length + 1, 2 * length - 1])[
330
+ :, :, :length, length - 1 :
331
+ ]
332
+ return x_final
333
+
334
+ def _absolute_position_to_relative_position(self, x):
335
+ """
336
+ x: [b, h, l, l]
337
+ ret: [b, h, l, 2*l-1]
338
+ """
339
+ batch, heads, length, _ = x.size()
340
+ # padd along column
341
+ x = F.pad(
342
+ x, commons.convert_pad_shape([[0, 0], [0, 0], [0, 0], [0, length - 1]])
343
+ )
344
+ x_flat = x.view([batch, heads, length**2 + length * (length - 1)])
345
+ # add 0's in the beginning that will skew the elements after reshape
346
+ x_flat = F.pad(x_flat, commons.convert_pad_shape([[0, 0], [0, 0], [length, 0]]))
347
+ x_final = x_flat.view([batch, heads, length, 2 * length])[:, :, :, 1:]
348
+ return x_final
349
+
350
+ def _attention_bias_proximal(self, length):
351
+ """Bias for self-attention to encourage attention to close positions.
352
+ Args:
353
+ length: an integer scalar.
354
+ Returns:
355
+ a Tensor with shape [1, 1, length, length]
356
+ """
357
+ r = torch.arange(length, dtype=torch.float32)
358
+ diff = torch.unsqueeze(r, 0) - torch.unsqueeze(r, 1)
359
+ return torch.unsqueeze(torch.unsqueeze(-torch.log1p(torch.abs(diff)), 0), 0)
360
+
361
+
362
+ class FFN(nn.Module):
363
+ def __init__(
364
+ self,
365
+ in_channels,
366
+ out_channels,
367
+ filter_channels,
368
+ kernel_size,
369
+ p_dropout=0.0,
370
+ activation=None,
371
+ causal=False,
372
+ ):
373
+ super().__init__()
374
+ self.in_channels = in_channels
375
+ self.out_channels = out_channels
376
+ self.filter_channels = filter_channels
377
+ self.kernel_size = kernel_size
378
+ self.p_dropout = p_dropout
379
+ self.activation = activation
380
+ self.causal = causal
381
+
382
+ if causal:
383
+ self.padding = self._causal_padding
384
+ else:
385
+ self.padding = self._same_padding
386
+
387
+ self.conv_1 = nn.Conv1d(in_channels, filter_channels, kernel_size)
388
+ self.conv_2 = nn.Conv1d(filter_channels, out_channels, kernel_size)
389
+ self.drop = nn.Dropout(p_dropout)
390
+
391
+ def forward(self, x, x_mask):
392
+ x = self.conv_1(self.padding(x * x_mask))
393
+ if self.activation == "gelu":
394
+ x = x * torch.sigmoid(1.702 * x)
395
+ else:
396
+ x = torch.relu(x)
397
+ x = self.drop(x)
398
+ x = self.conv_2(self.padding(x * x_mask))
399
+ return x * x_mask
400
+
401
+ def _causal_padding(self, x):
402
+ if self.kernel_size == 1:
403
+ return x
404
+ pad_l = self.kernel_size - 1
405
+ pad_r = 0
406
+ padding = [[0, 0], [0, 0], [pad_l, pad_r]]
407
+ x = F.pad(x, commons.convert_pad_shape(padding))
408
+ return x
409
+
410
+ def _same_padding(self, x):
411
+ if self.kernel_size == 1:
412
+ return x
413
+ pad_l = (self.kernel_size - 1) // 2
414
+ pad_r = self.kernel_size // 2
415
+ padding = [[0, 0], [0, 0], [pad_l, pad_r]]
416
+ x = F.pad(x, commons.convert_pad_shape(padding))
417
+ return x
infer_pack/commons.py ADDED
@@ -0,0 +1,164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math
2
+ import numpy as np
3
+ import torch
4
+ from torch import nn
5
+ from torch.nn import functional as F
6
+
7
+
8
+ def init_weights(m, mean=0.0, std=0.01):
9
+ classname = m.__class__.__name__
10
+ if classname.find("Conv") != -1:
11
+ m.weight.data.normal_(mean, std)
12
+
13
+
14
+ def get_padding(kernel_size, dilation=1):
15
+ return int((kernel_size * dilation - dilation) / 2)
16
+
17
+
18
+ def convert_pad_shape(pad_shape):
19
+ l = pad_shape[::-1]
20
+ pad_shape = [item for sublist in l for item in sublist]
21
+ return pad_shape
22
+
23
+
24
+ def kl_divergence(m_p, logs_p, m_q, logs_q):
25
+ """KL(P||Q)"""
26
+ kl = (logs_q - logs_p) - 0.5
27
+ kl += (
28
+ 0.5 * (torch.exp(2.0 * logs_p) + ((m_p - m_q) ** 2)) * torch.exp(-2.0 * logs_q)
29
+ )
30
+ return kl
31
+
32
+
33
+ def rand_gumbel(shape):
34
+ """Sample from the Gumbel distribution, protect from overflows."""
35
+ uniform_samples = torch.rand(shape) * 0.99998 + 0.00001
36
+ return -torch.log(-torch.log(uniform_samples))
37
+
38
+
39
+ def rand_gumbel_like(x):
40
+ g = rand_gumbel(x.size()).to(dtype=x.dtype, device=x.device)
41
+ return g
42
+
43
+
44
+ def slice_segments(x, ids_str, segment_size=4):
45
+ ret = torch.zeros_like(x[:, :, :segment_size])
46
+ for i in range(x.size(0)):
47
+ idx_str = ids_str[i]
48
+ idx_end = idx_str + segment_size
49
+ ret[i] = x[i, :, idx_str:idx_end]
50
+ return ret
51
+ def slice_segments2(x, ids_str, segment_size=4):
52
+ ret = torch.zeros_like(x[:, :segment_size])
53
+ for i in range(x.size(0)):
54
+ idx_str = ids_str[i]
55
+ idx_end = idx_str + segment_size
56
+ ret[i] = x[i, idx_str:idx_end]
57
+ return ret
58
+
59
+
60
+ def rand_slice_segments(x, x_lengths=None, segment_size=4):
61
+ b, d, t = x.size()
62
+ if x_lengths is None:
63
+ x_lengths = t
64
+ ids_str_max = x_lengths - segment_size + 1
65
+ ids_str = (torch.rand([b]).to(device=x.device) * ids_str_max).to(dtype=torch.long)
66
+ ret = slice_segments(x, ids_str, segment_size)
67
+ return ret, ids_str
68
+
69
+
70
+ def get_timing_signal_1d(length, channels, min_timescale=1.0, max_timescale=1.0e4):
71
+ position = torch.arange(length, dtype=torch.float)
72
+ num_timescales = channels // 2
73
+ log_timescale_increment = math.log(float(max_timescale) / float(min_timescale)) / (
74
+ num_timescales - 1
75
+ )
76
+ inv_timescales = min_timescale * torch.exp(
77
+ torch.arange(num_timescales, dtype=torch.float) * -log_timescale_increment
78
+ )
79
+ scaled_time = position.unsqueeze(0) * inv_timescales.unsqueeze(1)
80
+ signal = torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], 0)
81
+ signal = F.pad(signal, [0, 0, 0, channels % 2])
82
+ signal = signal.view(1, channels, length)
83
+ return signal
84
+
85
+
86
+ def add_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4):
87
+ b, channels, length = x.size()
88
+ signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
89
+ return x + signal.to(dtype=x.dtype, device=x.device)
90
+
91
+
92
+ def cat_timing_signal_1d(x, min_timescale=1.0, max_timescale=1.0e4, axis=1):
93
+ b, channels, length = x.size()
94
+ signal = get_timing_signal_1d(length, channels, min_timescale, max_timescale)
95
+ return torch.cat([x, signal.to(dtype=x.dtype, device=x.device)], axis)
96
+
97
+
98
+ def subsequent_mask(length):
99
+ mask = torch.tril(torch.ones(length, length)).unsqueeze(0).unsqueeze(0)
100
+ return mask
101
+
102
+
103
+ @torch.jit.script
104
+ def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
105
+ n_channels_int = n_channels[0]
106
+ in_act = input_a + input_b
107
+ t_act = torch.tanh(in_act[:, :n_channels_int, :])
108
+ s_act = torch.sigmoid(in_act[:, n_channels_int:, :])
109
+ acts = t_act * s_act
110
+ return acts
111
+
112
+
113
+ def convert_pad_shape(pad_shape):
114
+ l = pad_shape[::-1]
115
+ pad_shape = [item for sublist in l for item in sublist]
116
+ return pad_shape
117
+
118
+
119
+ def shift_1d(x):
120
+ x = F.pad(x, convert_pad_shape([[0, 0], [0, 0], [1, 0]]))[:, :, :-1]
121
+ return x
122
+
123
+
124
+ def sequence_mask(length, max_length=None):
125
+ if max_length is None:
126
+ max_length = length.max()
127
+ x = torch.arange(max_length, dtype=length.dtype, device=length.device)
128
+ return x.unsqueeze(0) < length.unsqueeze(1)
129
+
130
+
131
+ def generate_path(duration, mask):
132
+ """
133
+ duration: [b, 1, t_x]
134
+ mask: [b, 1, t_y, t_x]
135
+ """
136
+ device = duration.device
137
+
138
+ b, _, t_y, t_x = mask.shape
139
+ cum_duration = torch.cumsum(duration, -1)
140
+
141
+ cum_duration_flat = cum_duration.view(b * t_x)
142
+ path = sequence_mask(cum_duration_flat, t_y).to(mask.dtype)
143
+ path = path.view(b, t_x, t_y)
144
+ path = path - F.pad(path, convert_pad_shape([[0, 0], [1, 0], [0, 0]]))[:, :-1]
145
+ path = path.unsqueeze(1).transpose(2, 3) * mask
146
+ return path
147
+
148
+
149
+ def clip_grad_value_(parameters, clip_value, norm_type=2):
150
+ if isinstance(parameters, torch.Tensor):
151
+ parameters = [parameters]
152
+ parameters = list(filter(lambda p: p.grad is not None, parameters))
153
+ norm_type = float(norm_type)
154
+ if clip_value is not None:
155
+ clip_value = float(clip_value)
156
+
157
+ total_norm = 0
158
+ for p in parameters:
159
+ param_norm = p.grad.data.norm(norm_type)
160
+ total_norm += param_norm.item() ** norm_type
161
+ if clip_value is not None:
162
+ p.grad.data.clamp_(min=-clip_value, max=clip_value)
163
+ total_norm = total_norm ** (1.0 / norm_type)
164
+ return total_norm
infer_pack/models.py ADDED
@@ -0,0 +1,892 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math,pdb,os
2
+ from time import time as ttime
3
+ import torch
4
+ from torch import nn
5
+ from torch.nn import functional as F
6
+ from infer_pack import modules
7
+ from infer_pack import attentions
8
+ from infer_pack import commons
9
+ from infer_pack.commons import init_weights, get_padding
10
+ from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
11
+ from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
12
+ from infer_pack.commons import init_weights
13
+ import numpy as np
14
+ from infer_pack import commons
15
+ class TextEncoder256(nn.Module):
16
+ def __init__(
17
+ self, out_channels, hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout, f0=True ):
18
+ super().__init__()
19
+ self.out_channels = out_channels
20
+ self.hidden_channels = hidden_channels
21
+ self.filter_channels = filter_channels
22
+ self.n_heads = n_heads
23
+ self.n_layers = n_layers
24
+ self.kernel_size = kernel_size
25
+ self.p_dropout = p_dropout
26
+ self.emb_phone = nn.Linear(256, hidden_channels)
27
+ self.lrelu=nn.LeakyReLU(0.1,inplace=True)
28
+ if(f0==True):
29
+ self.emb_pitch = nn.Embedding(256, hidden_channels) # pitch 256
30
+ self.encoder = attentions.Encoder(
31
+ hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout
32
+ )
33
+ self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
34
+
35
+ def forward(self, phone, pitch, lengths):
36
+ if(pitch==None):
37
+ x = self.emb_phone(phone)
38
+ else:
39
+ x = self.emb_phone(phone) + self.emb_pitch(pitch)
40
+ x = x * math.sqrt(self.hidden_channels) # [b, t, h]
41
+ x=self.lrelu(x)
42
+ x = torch.transpose(x, 1, -1) # [b, h, t]
43
+ x_mask = torch.unsqueeze(commons.sequence_mask(lengths, x.size(2)), 1).to(
44
+ x.dtype
45
+ )
46
+ x = self.encoder(x * x_mask, x_mask)
47
+ stats = self.proj(x) * x_mask
48
+
49
+ m, logs = torch.split(stats, self.out_channels, dim=1)
50
+ return m, logs, x_mask
51
+ class TextEncoder256Sim(nn.Module):
52
+ def __init__( self, out_channels, hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout, f0=True):
53
+ super().__init__()
54
+ self.out_channels = out_channels
55
+ self.hidden_channels = hidden_channels
56
+ self.filter_channels = filter_channels
57
+ self.n_heads = n_heads
58
+ self.n_layers = n_layers
59
+ self.kernel_size = kernel_size
60
+ self.p_dropout = p_dropout
61
+ self.emb_phone = nn.Linear(256, hidden_channels)
62
+ self.lrelu=nn.LeakyReLU(0.1,inplace=True)
63
+ if(f0==True):
64
+ self.emb_pitch = nn.Embedding(256, hidden_channels) # pitch 256
65
+ self.encoder = attentions.Encoder(
66
+ hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout
67
+ )
68
+ self.proj = nn.Conv1d(hidden_channels, out_channels, 1)
69
+
70
+ def forward(self, phone, pitch, lengths):
71
+ if(pitch==None):
72
+ x = self.emb_phone(phone)
73
+ else:
74
+ x = self.emb_phone(phone) + self.emb_pitch(pitch)
75
+ x = x * math.sqrt(self.hidden_channels) # [b, t, h]
76
+ x=self.lrelu(x)
77
+ x = torch.transpose(x, 1, -1) # [b, h, t]
78
+ x_mask = torch.unsqueeze(commons.sequence_mask(lengths, x.size(2)), 1).to(x.dtype)
79
+ x = self.encoder(x * x_mask, x_mask)
80
+ x = self.proj(x) * x_mask
81
+ return x,x_mask
82
+ class ResidualCouplingBlock(nn.Module):
83
+ def __init__(
84
+ self,
85
+ channels,
86
+ hidden_channels,
87
+ kernel_size,
88
+ dilation_rate,
89
+ n_layers,
90
+ n_flows=4,
91
+ gin_channels=0,
92
+ ):
93
+ super().__init__()
94
+ self.channels = channels
95
+ self.hidden_channels = hidden_channels
96
+ self.kernel_size = kernel_size
97
+ self.dilation_rate = dilation_rate
98
+ self.n_layers = n_layers
99
+ self.n_flows = n_flows
100
+ self.gin_channels = gin_channels
101
+
102
+ self.flows = nn.ModuleList()
103
+ for i in range(n_flows):
104
+ self.flows.append(
105
+ modules.ResidualCouplingLayer(
106
+ channels,
107
+ hidden_channels,
108
+ kernel_size,
109
+ dilation_rate,
110
+ n_layers,
111
+ gin_channels=gin_channels,
112
+ mean_only=True,
113
+ )
114
+ )
115
+ self.flows.append(modules.Flip())
116
+
117
+ def forward(self, x, x_mask, g=None, reverse=False):
118
+ if not reverse:
119
+ for flow in self.flows:
120
+ x, _ = flow(x, x_mask, g=g, reverse=reverse)
121
+ else:
122
+ for flow in reversed(self.flows):
123
+ x = flow(x, x_mask, g=g, reverse=reverse)
124
+ return x
125
+
126
+ def remove_weight_norm(self):
127
+ for i in range(self.n_flows):
128
+ self.flows[i * 2].remove_weight_norm()
129
+ class PosteriorEncoder(nn.Module):
130
+ def __init__(
131
+ self,
132
+ in_channels,
133
+ out_channels,
134
+ hidden_channels,
135
+ kernel_size,
136
+ dilation_rate,
137
+ n_layers,
138
+ gin_channels=0,
139
+ ):
140
+ super().__init__()
141
+ self.in_channels = in_channels
142
+ self.out_channels = out_channels
143
+ self.hidden_channels = hidden_channels
144
+ self.kernel_size = kernel_size
145
+ self.dilation_rate = dilation_rate
146
+ self.n_layers = n_layers
147
+ self.gin_channels = gin_channels
148
+
149
+ self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
150
+ self.enc = modules.WN(
151
+ hidden_channels,
152
+ kernel_size,
153
+ dilation_rate,
154
+ n_layers,
155
+ gin_channels=gin_channels,
156
+ )
157
+ self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
158
+
159
+ def forward(self, x, x_lengths, g=None):
160
+ x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(
161
+ x.dtype
162
+ )
163
+ x = self.pre(x) * x_mask
164
+ x = self.enc(x, x_mask, g=g)
165
+ stats = self.proj(x) * x_mask
166
+ m, logs = torch.split(stats, self.out_channels, dim=1)
167
+ z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
168
+ return z, m, logs, x_mask
169
+
170
+ def remove_weight_norm(self):
171
+ self.enc.remove_weight_norm()
172
+ class Generator(torch.nn.Module):
173
+ def __init__(
174
+ self,
175
+ initial_channel,
176
+ resblock,
177
+ resblock_kernel_sizes,
178
+ resblock_dilation_sizes,
179
+ upsample_rates,
180
+ upsample_initial_channel,
181
+ upsample_kernel_sizes,
182
+ gin_channels=0,
183
+ ):
184
+ super(Generator, self).__init__()
185
+ self.num_kernels = len(resblock_kernel_sizes)
186
+ self.num_upsamples = len(upsample_rates)
187
+ self.conv_pre = Conv1d(
188
+ initial_channel, upsample_initial_channel, 7, 1, padding=3
189
+ )
190
+ resblock = modules.ResBlock1 if resblock == "1" else modules.ResBlock2
191
+
192
+ self.ups = nn.ModuleList()
193
+ for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
194
+ self.ups.append(
195
+ weight_norm(
196
+ ConvTranspose1d(
197
+ upsample_initial_channel // (2**i),
198
+ upsample_initial_channel // (2 ** (i + 1)),
199
+ k,
200
+ u,
201
+ padding=(k - u) // 2,
202
+ )
203
+ )
204
+ )
205
+
206
+ self.resblocks = nn.ModuleList()
207
+ for i in range(len(self.ups)):
208
+ ch = upsample_initial_channel // (2 ** (i + 1))
209
+ for j, (k, d) in enumerate(
210
+ zip(resblock_kernel_sizes, resblock_dilation_sizes)
211
+ ):
212
+ self.resblocks.append(resblock(ch, k, d))
213
+
214
+ self.conv_post = Conv1d(ch, 1, 7, 1, padding=3, bias=False)
215
+ self.ups.apply(init_weights)
216
+
217
+ if gin_channels != 0:
218
+ self.cond = nn.Conv1d(gin_channels, upsample_initial_channel, 1)
219
+
220
+ def forward(self, x, g=None):
221
+ x = self.conv_pre(x)
222
+ if g is not None:
223
+ x = x + self.cond(g)
224
+
225
+ for i in range(self.num_upsamples):
226
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
227
+ x = self.ups[i](x)
228
+ xs = None
229
+ for j in range(self.num_kernels):
230
+ if xs is None:
231
+ xs = self.resblocks[i * self.num_kernels + j](x)
232
+ else:
233
+ xs += self.resblocks[i * self.num_kernels + j](x)
234
+ x = xs / self.num_kernels
235
+ x = F.leaky_relu(x)
236
+ x = self.conv_post(x)
237
+ x = torch.tanh(x)
238
+
239
+ return x
240
+
241
+ def remove_weight_norm(self):
242
+ for l in self.ups:
243
+ remove_weight_norm(l)
244
+ for l in self.resblocks:
245
+ l.remove_weight_norm()
246
+ class SineGen(torch.nn.Module):
247
+ """ Definition of sine generator
248
+ SineGen(samp_rate, harmonic_num = 0,
249
+ sine_amp = 0.1, noise_std = 0.003,
250
+ voiced_threshold = 0,
251
+ flag_for_pulse=False)
252
+ samp_rate: sampling rate in Hz
253
+ harmonic_num: number of harmonic overtones (default 0)
254
+ sine_amp: amplitude of sine-wavefrom (default 0.1)
255
+ noise_std: std of Gaussian noise (default 0.003)
256
+ voiced_thoreshold: F0 threshold for U/V classification (default 0)
257
+ flag_for_pulse: this SinGen is used inside PulseGen (default False)
258
+ Note: when flag_for_pulse is True, the first time step of a voiced
259
+ segment is always sin(np.pi) or cos(0)
260
+ """
261
+
262
+ def __init__(self, samp_rate, harmonic_num=0,
263
+ sine_amp=0.1, noise_std=0.003,
264
+ voiced_threshold=0,
265
+ flag_for_pulse=False):
266
+ super(SineGen, self).__init__()
267
+ self.sine_amp = sine_amp
268
+ self.noise_std = noise_std
269
+ self.harmonic_num = harmonic_num
270
+ self.dim = self.harmonic_num + 1
271
+ self.sampling_rate = samp_rate
272
+ self.voiced_threshold = voiced_threshold
273
+
274
+ def _f02uv(self, f0):
275
+ # generate uv signal
276
+ uv = torch.ones_like(f0)
277
+ uv = uv * (f0 > self.voiced_threshold)
278
+ return uv
279
+
280
+ def forward(self, f0,upp):
281
+ """ sine_tensor, uv = forward(f0)
282
+ input F0: tensor(batchsize=1, length, dim=1)
283
+ f0 for unvoiced steps should be 0
284
+ output sine_tensor: tensor(batchsize=1, length, dim)
285
+ output uv: tensor(batchsize=1, length, 1)
286
+ """
287
+ with torch.no_grad():
288
+ f0 = f0[:, None].transpose(1, 2)
289
+ f0_buf = torch.zeros(f0.shape[0], f0.shape[1], self.dim,device=f0.device)
290
+ # fundamental component
291
+ f0_buf[:, :, 0] = f0[:, :, 0]
292
+ for idx in np.arange(self.harmonic_num):f0_buf[:, :, idx + 1] = f0_buf[:, :, 0] * (idx + 2)# idx + 2: the (idx+1)-th overtone, (idx+2)-th harmonic
293
+ rad_values = (f0_buf / self.sampling_rate) % 1###%1意味着n_har的乘积无法后处理优化
294
+ rand_ini = torch.rand(f0_buf.shape[0], f0_buf.shape[2], device=f0_buf.device)
295
+ rand_ini[:, 0] = 0
296
+ rad_values[:, 0, :] = rad_values[:, 0, :] + rand_ini
297
+ tmp_over_one = torch.cumsum(rad_values, 1)# % 1 #####%1意味着后面的cumsum无法再优化
298
+ tmp_over_one*=upp
299
+ tmp_over_one=F.interpolate(tmp_over_one.transpose(2, 1), scale_factor=upp, mode='linear', align_corners=True).transpose(2, 1)
300
+ rad_values=F.interpolate(rad_values.transpose(2, 1), scale_factor=upp, mode='nearest').transpose(2, 1)#######
301
+ tmp_over_one%=1
302
+ tmp_over_one_idx = (tmp_over_one[:, 1:, :] - tmp_over_one[:, :-1, :]) < 0
303
+ cumsum_shift = torch.zeros_like(rad_values)
304
+ cumsum_shift[:, 1:, :] = tmp_over_one_idx * -1.0
305
+ sine_waves = torch.sin(torch.cumsum(rad_values + cumsum_shift, dim=1) * 2 * np.pi)
306
+ sine_waves = sine_waves * self.sine_amp
307
+ uv = self._f02uv(f0)
308
+ uv = F.interpolate(uv.transpose(2, 1), scale_factor=upp, mode='nearest').transpose(2, 1)
309
+ noise_amp = uv * self.noise_std + (1 - uv) * self.sine_amp / 3
310
+ noise = noise_amp * torch.randn_like(sine_waves)
311
+ sine_waves = sine_waves * uv + noise
312
+ return sine_waves, uv, noise
313
+ class SourceModuleHnNSF(torch.nn.Module):
314
+ """ SourceModule for hn-nsf
315
+ SourceModule(sampling_rate, harmonic_num=0, sine_amp=0.1,
316
+ add_noise_std=0.003, voiced_threshod=0)
317
+ sampling_rate: sampling_rate in Hz
318
+ harmonic_num: number of harmonic above F0 (default: 0)
319
+ sine_amp: amplitude of sine source signal (default: 0.1)
320
+ add_noise_std: std of additive Gaussian noise (default: 0.003)
321
+ note that amplitude of noise in unvoiced is decided
322
+ by sine_amp
323
+ voiced_threshold: threhold to set U/V given F0 (default: 0)
324
+ Sine_source, noise_source = SourceModuleHnNSF(F0_sampled)
325
+ F0_sampled (batchsize, length, 1)
326
+ Sine_source (batchsize, length, 1)
327
+ noise_source (batchsize, length 1)
328
+ uv (batchsize, length, 1)
329
+ """
330
+
331
+ def __init__(self, sampling_rate, harmonic_num=0, sine_amp=0.1,
332
+ add_noise_std=0.003, voiced_threshod=0,is_half=True):
333
+ super(SourceModuleHnNSF, self).__init__()
334
+
335
+ self.sine_amp = sine_amp
336
+ self.noise_std = add_noise_std
337
+ self.is_half=is_half
338
+ # to produce sine waveforms
339
+ self.l_sin_gen = SineGen(sampling_rate, harmonic_num,
340
+ sine_amp, add_noise_std, voiced_threshod)
341
+
342
+ # to merge source harmonics into a single excitation
343
+ self.l_linear = torch.nn.Linear(harmonic_num + 1, 1)
344
+ self.l_tanh = torch.nn.Tanh()
345
+
346
+ def forward(self, x,upp=None):
347
+ sine_wavs, uv, _ = self.l_sin_gen(x,upp)
348
+ if(self.is_half):sine_wavs=sine_wavs.half()
349
+ sine_merge = self.l_tanh(self.l_linear(sine_wavs))
350
+ return sine_merge,None,None# noise, uv
351
+ class GeneratorNSF(torch.nn.Module):
352
+ def __init__(
353
+ self,
354
+ initial_channel,
355
+ resblock,
356
+ resblock_kernel_sizes,
357
+ resblock_dilation_sizes,
358
+ upsample_rates,
359
+ upsample_initial_channel,
360
+ upsample_kernel_sizes,
361
+ gin_channels,
362
+ sr,
363
+ is_half=False
364
+ ):
365
+ super(GeneratorNSF, self).__init__()
366
+ self.num_kernels = len(resblock_kernel_sizes)
367
+ self.num_upsamples = len(upsample_rates)
368
+
369
+ self.f0_upsamp = torch.nn.Upsample(scale_factor=np.prod(upsample_rates))
370
+ self.m_source = SourceModuleHnNSF(
371
+ sampling_rate=sr,
372
+ harmonic_num=0,
373
+ is_half=is_half
374
+ )
375
+ self.noise_convs = nn.ModuleList()
376
+ self.conv_pre = Conv1d(
377
+ initial_channel, upsample_initial_channel, 7, 1, padding=3
378
+ )
379
+ resblock = modules.ResBlock1 if resblock == "1" else modules.ResBlock2
380
+
381
+ self.ups = nn.ModuleList()
382
+ for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
383
+ c_cur = upsample_initial_channel // (2 ** (i + 1))
384
+ self.ups.append(
385
+ weight_norm(
386
+ ConvTranspose1d(
387
+ upsample_initial_channel // (2**i),
388
+ upsample_initial_channel // (2 ** (i + 1)),
389
+ k,
390
+ u,
391
+ padding=(k - u) // 2,
392
+ )
393
+ )
394
+ )
395
+ if i + 1 < len(upsample_rates):
396
+ stride_f0 = np.prod(upsample_rates[i + 1:])
397
+ self.noise_convs.append(Conv1d(
398
+ 1, c_cur, kernel_size=stride_f0 * 2, stride=stride_f0, padding=stride_f0 // 2))
399
+ else:
400
+ self.noise_convs.append(Conv1d(1, c_cur, kernel_size=1))
401
+
402
+ self.resblocks = nn.ModuleList()
403
+ for i in range(len(self.ups)):
404
+ ch = upsample_initial_channel // (2 ** (i + 1))
405
+ for j, (k, d) in enumerate(
406
+ zip(resblock_kernel_sizes, resblock_dilation_sizes)
407
+ ):
408
+ self.resblocks.append(resblock(ch, k, d))
409
+
410
+ self.conv_post = Conv1d(ch, 1, 7, 1, padding=3, bias=False)
411
+ self.ups.apply(init_weights)
412
+
413
+ if gin_channels != 0:
414
+ self.cond = nn.Conv1d(gin_channels, upsample_initial_channel, 1)
415
+
416
+ self.upp=np.prod(upsample_rates)
417
+
418
+ def forward(self, x, f0,g=None):
419
+ har_source, noi_source, uv = self.m_source(f0,self.upp)
420
+ har_source = har_source.transpose(1, 2)
421
+ x = self.conv_pre(x)
422
+ if g is not None:
423
+ x = x + self.cond(g)
424
+
425
+ for i in range(self.num_upsamples):
426
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
427
+ x = self.ups[i](x)
428
+ x_source = self.noise_convs[i](har_source)
429
+ x = x + x_source
430
+ xs = None
431
+ for j in range(self.num_kernels):
432
+ if xs is None:
433
+ xs = self.resblocks[i * self.num_kernels + j](x)
434
+ else:
435
+ xs += self.resblocks[i * self.num_kernels + j](x)
436
+ x = xs / self.num_kernels
437
+ x = F.leaky_relu(x)
438
+ x = self.conv_post(x)
439
+ x = torch.tanh(x)
440
+ return x
441
+
442
+ def remove_weight_norm(self):
443
+ for l in self.ups:
444
+ remove_weight_norm(l)
445
+ for l in self.resblocks:
446
+ l.remove_weight_norm()
447
+ sr2sr={
448
+ "32k":32000,
449
+ "40k":40000,
450
+ "48k":48000,
451
+ }
452
+ class SynthesizerTrnMs256NSFsid(nn.Module):
453
+ def __init__(
454
+ self,
455
+ spec_channels,
456
+ segment_size,
457
+ inter_channels,
458
+ hidden_channels,
459
+ filter_channels,
460
+ n_heads,
461
+ n_layers,
462
+ kernel_size,
463
+ p_dropout,
464
+ resblock,
465
+ resblock_kernel_sizes,
466
+ resblock_dilation_sizes,
467
+ upsample_rates,
468
+ upsample_initial_channel,
469
+ upsample_kernel_sizes,
470
+ spk_embed_dim,
471
+ gin_channels,
472
+ sr,
473
+ **kwargs
474
+ ):
475
+
476
+ super().__init__()
477
+ if(type(sr)==type("strr")):
478
+ sr=sr2sr[sr]
479
+ self.spec_channels = spec_channels
480
+ self.inter_channels = inter_channels
481
+ self.hidden_channels = hidden_channels
482
+ self.filter_channels = filter_channels
483
+ self.n_heads = n_heads
484
+ self.n_layers = n_layers
485
+ self.kernel_size = kernel_size
486
+ self.p_dropout = p_dropout
487
+ self.resblock = resblock
488
+ self.resblock_kernel_sizes = resblock_kernel_sizes
489
+ self.resblock_dilation_sizes = resblock_dilation_sizes
490
+ self.upsample_rates = upsample_rates
491
+ self.upsample_initial_channel = upsample_initial_channel
492
+ self.upsample_kernel_sizes = upsample_kernel_sizes
493
+ self.segment_size = segment_size
494
+ self.gin_channels = gin_channels
495
+ # self.hop_length = hop_length#
496
+ self.spk_embed_dim=spk_embed_dim
497
+ self.enc_p = TextEncoder256(
498
+ inter_channels,
499
+ hidden_channels,
500
+ filter_channels,
501
+ n_heads,
502
+ n_layers,
503
+ kernel_size,
504
+ p_dropout,
505
+ )
506
+ self.dec = GeneratorNSF(
507
+ inter_channels,
508
+ resblock,
509
+ resblock_kernel_sizes,
510
+ resblock_dilation_sizes,
511
+ upsample_rates,
512
+ upsample_initial_channel,
513
+ upsample_kernel_sizes,
514
+ gin_channels=gin_channels, sr=sr, is_half=kwargs["is_half"]
515
+ )
516
+ self.enc_q = PosteriorEncoder(
517
+ spec_channels,
518
+ inter_channels,
519
+ hidden_channels,
520
+ 5,
521
+ 1,
522
+ 16,
523
+ gin_channels=gin_channels,
524
+ )
525
+ self.flow = ResidualCouplingBlock(
526
+ inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
527
+ )
528
+ self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
529
+ print("gin_channels:",gin_channels,"self.spk_embed_dim:",self.spk_embed_dim)
530
+ def remove_weight_norm(self):
531
+ self.dec.remove_weight_norm()
532
+ self.flow.remove_weight_norm()
533
+ self.enc_q.remove_weight_norm()
534
+
535
+ def forward(self, phone, phone_lengths, pitch,pitchf, y, y_lengths,ds):#这里ds是id,[bs,1]
536
+ # print(1,pitch.shape)#[bs,t]
537
+ g = self.emb_g(ds).unsqueeze(-1) # [b, 256, 1]##1是t,广播的
538
+ m_p, logs_p, x_mask = self.enc_p(phone, pitch, phone_lengths)
539
+ z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g)
540
+ z_p = self.flow(z, y_mask, g=g)
541
+ z_slice, ids_slice = commons.rand_slice_segments(
542
+ z, y_lengths, self.segment_size
543
+ )
544
+ # print(-1,pitchf.shape,ids_slice,self.segment_size,self.hop_length,self.segment_size//self.hop_length)
545
+ pitchf = commons.slice_segments2(
546
+ pitchf, ids_slice, self.segment_size
547
+ )
548
+ # print(-2,pitchf.shape,z_slice.shape)
549
+ o = self.dec(z_slice,pitchf, g=g)
550
+ return o, ids_slice, x_mask, y_mask, (z, z_p, m_p, logs_p, m_q, logs_q)
551
+
552
+ def infer(self, phone, phone_lengths, pitch, nsff0,sid,max_len=None):
553
+ g = self.emb_g(sid).unsqueeze(-1)
554
+ m_p, logs_p, x_mask = self.enc_p(phone, pitch, phone_lengths)
555
+ z_p = (m_p + torch.exp(logs_p) * torch.randn_like(m_p) * 0.66666) * x_mask
556
+ z = self.flow(z_p, x_mask, g=g, reverse=True)
557
+ o = self.dec((z * x_mask)[:, :, :max_len], nsff0,g=g)
558
+ return o, x_mask, (z, z_p, m_p, logs_p)
559
+ class SynthesizerTrnMs256NSFsid_nono(nn.Module):
560
+ def __init__(
561
+ self,
562
+ spec_channels,
563
+ segment_size,
564
+ inter_channels,
565
+ hidden_channels,
566
+ filter_channels,
567
+ n_heads,
568
+ n_layers,
569
+ kernel_size,
570
+ p_dropout,
571
+ resblock,
572
+ resblock_kernel_sizes,
573
+ resblock_dilation_sizes,
574
+ upsample_rates,
575
+ upsample_initial_channel,
576
+ upsample_kernel_sizes,
577
+ spk_embed_dim,
578
+ gin_channels,
579
+ sr=None,
580
+ **kwargs
581
+ ):
582
+
583
+ super().__init__()
584
+ self.spec_channels = spec_channels
585
+ self.inter_channels = inter_channels
586
+ self.hidden_channels = hidden_channels
587
+ self.filter_channels = filter_channels
588
+ self.n_heads = n_heads
589
+ self.n_layers = n_layers
590
+ self.kernel_size = kernel_size
591
+ self.p_dropout = p_dropout
592
+ self.resblock = resblock
593
+ self.resblock_kernel_sizes = resblock_kernel_sizes
594
+ self.resblock_dilation_sizes = resblock_dilation_sizes
595
+ self.upsample_rates = upsample_rates
596
+ self.upsample_initial_channel = upsample_initial_channel
597
+ self.upsample_kernel_sizes = upsample_kernel_sizes
598
+ self.segment_size = segment_size
599
+ self.gin_channels = gin_channels
600
+ # self.hop_length = hop_length#
601
+ self.spk_embed_dim=spk_embed_dim
602
+ self.enc_p = TextEncoder256(
603
+ inter_channels,
604
+ hidden_channels,
605
+ filter_channels,
606
+ n_heads,
607
+ n_layers,
608
+ kernel_size,
609
+ p_dropout,f0=False
610
+ )
611
+ self.dec = Generator(
612
+ inter_channels,
613
+ resblock,
614
+ resblock_kernel_sizes,
615
+ resblock_dilation_sizes,
616
+ upsample_rates,
617
+ upsample_initial_channel,
618
+ upsample_kernel_sizes,
619
+ gin_channels=gin_channels
620
+ )
621
+ self.enc_q = PosteriorEncoder(
622
+ spec_channels,
623
+ inter_channels,
624
+ hidden_channels,
625
+ 5,
626
+ 1,
627
+ 16,
628
+ gin_channels=gin_channels,
629
+ )
630
+ self.flow = ResidualCouplingBlock(
631
+ inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
632
+ )
633
+ self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
634
+ print("gin_channels:",gin_channels,"self.spk_embed_dim:",self.spk_embed_dim)
635
+
636
+ def remove_weight_norm(self):
637
+ self.dec.remove_weight_norm()
638
+ self.flow.remove_weight_norm()
639
+ self.enc_q.remove_weight_norm()
640
+
641
+ def forward(self, phone, phone_lengths, y, y_lengths,ds):#这里ds是id,[bs,1]
642
+ g = self.emb_g(ds).unsqueeze(-1) # [b, 256, 1]##1是t,广播的
643
+ m_p, logs_p, x_mask = self.enc_p(phone, None, phone_lengths)
644
+ z, m_q, logs_q, y_mask = self.enc_q(y, y_lengths, g=g)
645
+ z_p = self.flow(z, y_mask, g=g)
646
+ z_slice, ids_slice = commons.rand_slice_segments(
647
+ z, y_lengths, self.segment_size
648
+ )
649
+ o = self.dec(z_slice, g=g)
650
+ return o, ids_slice, x_mask, y_mask, (z, z_p, m_p, logs_p, m_q, logs_q)
651
+
652
+ def infer(self, phone, phone_lengths,sid,max_len=None):
653
+ g = self.emb_g(sid).unsqueeze(-1)
654
+ m_p, logs_p, x_mask = self.enc_p(phone, None, phone_lengths)
655
+ z_p = (m_p + torch.exp(logs_p) * torch.randn_like(m_p) * 0.66666) * x_mask
656
+ z = self.flow(z_p, x_mask, g=g, reverse=True)
657
+ o = self.dec((z * x_mask)[:, :, :max_len],g=g)
658
+ return o, x_mask, (z, z_p, m_p, logs_p)
659
+ class SynthesizerTrnMs256NSFsid_sim(nn.Module):
660
+ """
661
+ Synthesizer for Training
662
+ """
663
+
664
+ def __init__(
665
+ self,
666
+ spec_channels,
667
+ segment_size,
668
+ inter_channels,
669
+ hidden_channels,
670
+ filter_channels,
671
+ n_heads,
672
+ n_layers,
673
+ kernel_size,
674
+ p_dropout,
675
+ resblock,
676
+ resblock_kernel_sizes,
677
+ resblock_dilation_sizes,
678
+ upsample_rates,
679
+ upsample_initial_channel,
680
+ upsample_kernel_sizes,
681
+ spk_embed_dim,
682
+ # hop_length,
683
+ gin_channels=0,
684
+ use_sdp=True,
685
+ **kwargs
686
+ ):
687
+
688
+ super().__init__()
689
+ self.spec_channels = spec_channels
690
+ self.inter_channels = inter_channels
691
+ self.hidden_channels = hidden_channels
692
+ self.filter_channels = filter_channels
693
+ self.n_heads = n_heads
694
+ self.n_layers = n_layers
695
+ self.kernel_size = kernel_size
696
+ self.p_dropout = p_dropout
697
+ self.resblock = resblock
698
+ self.resblock_kernel_sizes = resblock_kernel_sizes
699
+ self.resblock_dilation_sizes = resblock_dilation_sizes
700
+ self.upsample_rates = upsample_rates
701
+ self.upsample_initial_channel = upsample_initial_channel
702
+ self.upsample_kernel_sizes = upsample_kernel_sizes
703
+ self.segment_size = segment_size
704
+ self.gin_channels = gin_channels
705
+ # self.hop_length = hop_length#
706
+ self.spk_embed_dim=spk_embed_dim
707
+ self.enc_p = TextEncoder256Sim(
708
+ inter_channels,
709
+ hidden_channels,
710
+ filter_channels,
711
+ n_heads,
712
+ n_layers,
713
+ kernel_size,
714
+ p_dropout,
715
+ )
716
+ self.dec = GeneratorNSF(
717
+ inter_channels,
718
+ resblock,
719
+ resblock_kernel_sizes,
720
+ resblock_dilation_sizes,
721
+ upsample_rates,
722
+ upsample_initial_channel,
723
+ upsample_kernel_sizes,
724
+ gin_channels=gin_channels,is_half=kwargs["is_half"]
725
+ )
726
+
727
+ self.flow = ResidualCouplingBlock(
728
+ inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
729
+ )
730
+ self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
731
+ print("gin_channels:",gin_channels,"self.spk_embed_dim:",self.spk_embed_dim)
732
+ def remove_weight_norm(self):
733
+ self.dec.remove_weight_norm()
734
+ self.flow.remove_weight_norm()
735
+ self.enc_q.remove_weight_norm()
736
+
737
+ def forward(self, phone, phone_lengths, pitch, pitchf, y_lengths,ds): # y是spec不需要了现在
738
+ g = self.emb_g(ds).unsqueeze(-1) # [b, 256, 1]##1是t,广播的
739
+ x, x_mask = self.enc_p(phone, pitch, phone_lengths)
740
+ x = self.flow(x, x_mask, g=g, reverse=True)
741
+ z_slice, ids_slice = commons.rand_slice_segments(
742
+ x, y_lengths, self.segment_size
743
+ )
744
+
745
+ pitchf = commons.slice_segments2(
746
+ pitchf, ids_slice, self.segment_size
747
+ )
748
+ o = self.dec(z_slice, pitchf, g=g)
749
+ return o, ids_slice
750
+ def infer(self, phone, phone_lengths, pitch, pitchf, ds,max_len=None): # y是spec不需要了现在
751
+ g = self.emb_g(ds).unsqueeze(-1) # [b, 256, 1]##1是t,广播的
752
+ x, x_mask = self.enc_p(phone, pitch, phone_lengths)
753
+ x = self.flow(x, x_mask, g=g, reverse=True)
754
+ o = self.dec((x*x_mask)[:, :, :max_len], pitchf, g=g)
755
+ return o, o
756
+
757
+ class MultiPeriodDiscriminator(torch.nn.Module):
758
+ def __init__(self, use_spectral_norm=False):
759
+ super(MultiPeriodDiscriminator, self).__init__()
760
+ periods = [2, 3, 5, 7, 11,17]
761
+ # periods = [3, 5, 7, 11, 17, 23, 37]
762
+
763
+ discs = [DiscriminatorS(use_spectral_norm=use_spectral_norm)]
764
+ discs = discs + [
765
+ DiscriminatorP(i, use_spectral_norm=use_spectral_norm) for i in periods
766
+ ]
767
+ self.discriminators = nn.ModuleList(discs)
768
+
769
+ def forward(self, y, y_hat):
770
+ y_d_rs = []#
771
+ y_d_gs = []
772
+ fmap_rs = []
773
+ fmap_gs = []
774
+ for i, d in enumerate(self.discriminators):
775
+ y_d_r, fmap_r = d(y)
776
+ y_d_g, fmap_g = d(y_hat)
777
+ # for j in range(len(fmap_r)):
778
+ # print(i,j,y.shape,y_hat.shape,fmap_r[j].shape,fmap_g[j].shape)
779
+ y_d_rs.append(y_d_r)
780
+ y_d_gs.append(y_d_g)
781
+ fmap_rs.append(fmap_r)
782
+ fmap_gs.append(fmap_g)
783
+
784
+ return y_d_rs, y_d_gs, fmap_rs, fmap_gs
785
+
786
+ class DiscriminatorS(torch.nn.Module):
787
+ def __init__(self, use_spectral_norm=False):
788
+ super(DiscriminatorS, self).__init__()
789
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
790
+ self.convs = nn.ModuleList(
791
+ [
792
+ norm_f(Conv1d(1, 16, 15, 1, padding=7)),
793
+ norm_f(Conv1d(16, 64, 41, 4, groups=4, padding=20)),
794
+ norm_f(Conv1d(64, 256, 41, 4, groups=16, padding=20)),
795
+ norm_f(Conv1d(256, 1024, 41, 4, groups=64, padding=20)),
796
+ norm_f(Conv1d(1024, 1024, 41, 4, groups=256, padding=20)),
797
+ norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
798
+ ]
799
+ )
800
+ self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
801
+
802
+ def forward(self, x):
803
+ fmap = []
804
+
805
+ for l in self.convs:
806
+ x = l(x)
807
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
808
+ fmap.append(x)
809
+ x = self.conv_post(x)
810
+ fmap.append(x)
811
+ x = torch.flatten(x, 1, -1)
812
+
813
+ return x, fmap
814
+
815
+ class DiscriminatorP(torch.nn.Module):
816
+ def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
817
+ super(DiscriminatorP, self).__init__()
818
+ self.period = period
819
+ self.use_spectral_norm = use_spectral_norm
820
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
821
+ self.convs = nn.ModuleList(
822
+ [
823
+ norm_f(
824
+ Conv2d(
825
+ 1,
826
+ 32,
827
+ (kernel_size, 1),
828
+ (stride, 1),
829
+ padding=(get_padding(kernel_size, 1), 0),
830
+ )
831
+ ),
832
+ norm_f(
833
+ Conv2d(
834
+ 32,
835
+ 128,
836
+ (kernel_size, 1),
837
+ (stride, 1),
838
+ padding=(get_padding(kernel_size, 1), 0),
839
+ )
840
+ ),
841
+ norm_f(
842
+ Conv2d(
843
+ 128,
844
+ 512,
845
+ (kernel_size, 1),
846
+ (stride, 1),
847
+ padding=(get_padding(kernel_size, 1), 0),
848
+ )
849
+ ),
850
+ norm_f(
851
+ Conv2d(
852
+ 512,
853
+ 1024,
854
+ (kernel_size, 1),
855
+ (stride, 1),
856
+ padding=(get_padding(kernel_size, 1), 0),
857
+ )
858
+ ),
859
+ norm_f(
860
+ Conv2d(
861
+ 1024,
862
+ 1024,
863
+ (kernel_size, 1),
864
+ 1,
865
+ padding=(get_padding(kernel_size, 1), 0),
866
+ )
867
+ ),
868
+ ]
869
+ )
870
+ self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
871
+
872
+ def forward(self, x):
873
+ fmap = []
874
+
875
+ # 1d to 2d
876
+ b, c, t = x.shape
877
+ if t % self.period != 0: # pad first
878
+ n_pad = self.period - (t % self.period)
879
+ x = F.pad(x, (0, n_pad), "reflect")
880
+ t = t + n_pad
881
+ x = x.view(b, c, t // self.period, self.period)
882
+
883
+ for l in self.convs:
884
+ x = l(x)
885
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
886
+ fmap.append(x)
887
+ x = self.conv_post(x)
888
+ fmap.append(x)
889
+ x = torch.flatten(x, 1, -1)
890
+
891
+ return x, fmap
892
+
infer_pack/models_onnx.py ADDED
@@ -0,0 +1,764 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import math,pdb,os
2
+ from time import time as ttime
3
+ import torch
4
+ from torch import nn
5
+ from torch.nn import functional as F
6
+ from infer_pack import modules
7
+ from infer_pack import attentions
8
+ from infer_pack import commons
9
+ from infer_pack.commons import init_weights, get_padding
10
+ from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
11
+ from torch.nn.utils import weight_norm, remove_weight_norm, spectral_norm
12
+ from infer_pack.commons import init_weights
13
+ import numpy as np
14
+ from infer_pack import commons
15
+ class TextEncoder256(nn.Module):
16
+ def __init__(
17
+ self, out_channels, hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout, f0=True ):
18
+ super().__init__()
19
+ self.out_channels = out_channels
20
+ self.hidden_channels = hidden_channels
21
+ self.filter_channels = filter_channels
22
+ self.n_heads = n_heads
23
+ self.n_layers = n_layers
24
+ self.kernel_size = kernel_size
25
+ self.p_dropout = p_dropout
26
+ self.emb_phone = nn.Linear(256, hidden_channels)
27
+ self.lrelu=nn.LeakyReLU(0.1,inplace=True)
28
+ if(f0==True):
29
+ self.emb_pitch = nn.Embedding(256, hidden_channels) # pitch 256
30
+ self.encoder = attentions.Encoder(
31
+ hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout
32
+ )
33
+ self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
34
+
35
+ def forward(self, phone, pitch, lengths):
36
+ if(pitch==None):
37
+ x = self.emb_phone(phone)
38
+ else:
39
+ x = self.emb_phone(phone) + self.emb_pitch(pitch)
40
+ x = x * math.sqrt(self.hidden_channels) # [b, t, h]
41
+ x=self.lrelu(x)
42
+ x = torch.transpose(x, 1, -1) # [b, h, t]
43
+ x_mask = torch.unsqueeze(commons.sequence_mask(lengths, x.size(2)), 1).to(
44
+ x.dtype
45
+ )
46
+ x = self.encoder(x * x_mask, x_mask)
47
+ stats = self.proj(x) * x_mask
48
+
49
+ m, logs = torch.split(stats, self.out_channels, dim=1)
50
+ return m, logs, x_mask
51
+ class TextEncoder256Sim(nn.Module):
52
+ def __init__( self, out_channels, hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout, f0=True):
53
+ super().__init__()
54
+ self.out_channels = out_channels
55
+ self.hidden_channels = hidden_channels
56
+ self.filter_channels = filter_channels
57
+ self.n_heads = n_heads
58
+ self.n_layers = n_layers
59
+ self.kernel_size = kernel_size
60
+ self.p_dropout = p_dropout
61
+ self.emb_phone = nn.Linear(256, hidden_channels)
62
+ self.lrelu=nn.LeakyReLU(0.1,inplace=True)
63
+ if(f0==True):
64
+ self.emb_pitch = nn.Embedding(256, hidden_channels) # pitch 256
65
+ self.encoder = attentions.Encoder(
66
+ hidden_channels, filter_channels, n_heads, n_layers, kernel_size, p_dropout
67
+ )
68
+ self.proj = nn.Conv1d(hidden_channels, out_channels, 1)
69
+
70
+ def forward(self, phone, pitch, lengths):
71
+ if(pitch==None):
72
+ x = self.emb_phone(phone)
73
+ else:
74
+ x = self.emb_phone(phone) + self.emb_pitch(pitch)
75
+ x = x * math.sqrt(self.hidden_channels) # [b, t, h]
76
+ x=self.lrelu(x)
77
+ x = torch.transpose(x, 1, -1) # [b, h, t]
78
+ x_mask = torch.unsqueeze(commons.sequence_mask(lengths, x.size(2)), 1).to(x.dtype)
79
+ x = self.encoder(x * x_mask, x_mask)
80
+ x = self.proj(x) * x_mask
81
+ return x,x_mask
82
+ class ResidualCouplingBlock(nn.Module):
83
+ def __init__(
84
+ self,
85
+ channels,
86
+ hidden_channels,
87
+ kernel_size,
88
+ dilation_rate,
89
+ n_layers,
90
+ n_flows=4,
91
+ gin_channels=0,
92
+ ):
93
+ super().__init__()
94
+ self.channels = channels
95
+ self.hidden_channels = hidden_channels
96
+ self.kernel_size = kernel_size
97
+ self.dilation_rate = dilation_rate
98
+ self.n_layers = n_layers
99
+ self.n_flows = n_flows
100
+ self.gin_channels = gin_channels
101
+
102
+ self.flows = nn.ModuleList()
103
+ for i in range(n_flows):
104
+ self.flows.append(
105
+ modules.ResidualCouplingLayer(
106
+ channels,
107
+ hidden_channels,
108
+ kernel_size,
109
+ dilation_rate,
110
+ n_layers,
111
+ gin_channels=gin_channels,
112
+ mean_only=True,
113
+ )
114
+ )
115
+ self.flows.append(modules.Flip())
116
+
117
+ def forward(self, x, x_mask, g=None, reverse=False):
118
+ if not reverse:
119
+ for flow in self.flows:
120
+ x, _ = flow(x, x_mask, g=g, reverse=reverse)
121
+ else:
122
+ for flow in reversed(self.flows):
123
+ x = flow(x, x_mask, g=g, reverse=reverse)
124
+ return x
125
+
126
+ def remove_weight_norm(self):
127
+ for i in range(self.n_flows):
128
+ self.flows[i * 2].remove_weight_norm()
129
+ class PosteriorEncoder(nn.Module):
130
+ def __init__(
131
+ self,
132
+ in_channels,
133
+ out_channels,
134
+ hidden_channels,
135
+ kernel_size,
136
+ dilation_rate,
137
+ n_layers,
138
+ gin_channels=0,
139
+ ):
140
+ super().__init__()
141
+ self.in_channels = in_channels
142
+ self.out_channels = out_channels
143
+ self.hidden_channels = hidden_channels
144
+ self.kernel_size = kernel_size
145
+ self.dilation_rate = dilation_rate
146
+ self.n_layers = n_layers
147
+ self.gin_channels = gin_channels
148
+
149
+ self.pre = nn.Conv1d(in_channels, hidden_channels, 1)
150
+ self.enc = modules.WN(
151
+ hidden_channels,
152
+ kernel_size,
153
+ dilation_rate,
154
+ n_layers,
155
+ gin_channels=gin_channels,
156
+ )
157
+ self.proj = nn.Conv1d(hidden_channels, out_channels * 2, 1)
158
+
159
+ def forward(self, x, x_lengths, g=None):
160
+ x_mask = torch.unsqueeze(commons.sequence_mask(x_lengths, x.size(2)), 1).to(
161
+ x.dtype
162
+ )
163
+ x = self.pre(x) * x_mask
164
+ x = self.enc(x, x_mask, g=g)
165
+ stats = self.proj(x) * x_mask
166
+ m, logs = torch.split(stats, self.out_channels, dim=1)
167
+ z = (m + torch.randn_like(m) * torch.exp(logs)) * x_mask
168
+ return z, m, logs, x_mask
169
+
170
+ def remove_weight_norm(self):
171
+ self.enc.remove_weight_norm()
172
+ class Generator(torch.nn.Module):
173
+ def __init__(
174
+ self,
175
+ initial_channel,
176
+ resblock,
177
+ resblock_kernel_sizes,
178
+ resblock_dilation_sizes,
179
+ upsample_rates,
180
+ upsample_initial_channel,
181
+ upsample_kernel_sizes,
182
+ gin_channels=0,
183
+ ):
184
+ super(Generator, self).__init__()
185
+ self.num_kernels = len(resblock_kernel_sizes)
186
+ self.num_upsamples = len(upsample_rates)
187
+ self.conv_pre = Conv1d(
188
+ initial_channel, upsample_initial_channel, 7, 1, padding=3
189
+ )
190
+ resblock = modules.ResBlock1 if resblock == "1" else modules.ResBlock2
191
+
192
+ self.ups = nn.ModuleList()
193
+ for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
194
+ self.ups.append(
195
+ weight_norm(
196
+ ConvTranspose1d(
197
+ upsample_initial_channel // (2**i),
198
+ upsample_initial_channel // (2 ** (i + 1)),
199
+ k,
200
+ u,
201
+ padding=(k - u) // 2,
202
+ )
203
+ )
204
+ )
205
+
206
+ self.resblocks = nn.ModuleList()
207
+ for i in range(len(self.ups)):
208
+ ch = upsample_initial_channel // (2 ** (i + 1))
209
+ for j, (k, d) in enumerate(
210
+ zip(resblock_kernel_sizes, resblock_dilation_sizes)
211
+ ):
212
+ self.resblocks.append(resblock(ch, k, d))
213
+
214
+ self.conv_post = Conv1d(ch, 1, 7, 1, padding=3, bias=False)
215
+ self.ups.apply(init_weights)
216
+
217
+ if gin_channels != 0:
218
+ self.cond = nn.Conv1d(gin_channels, upsample_initial_channel, 1)
219
+
220
+ def forward(self, x, g=None):
221
+ x = self.conv_pre(x)
222
+ if g is not None:
223
+ x = x + self.cond(g)
224
+
225
+ for i in range(self.num_upsamples):
226
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
227
+ x = self.ups[i](x)
228
+ xs = None
229
+ for j in range(self.num_kernels):
230
+ if xs is None:
231
+ xs = self.resblocks[i * self.num_kernels + j](x)
232
+ else:
233
+ xs += self.resblocks[i * self.num_kernels + j](x)
234
+ x = xs / self.num_kernels
235
+ x = F.leaky_relu(x)
236
+ x = self.conv_post(x)
237
+ x = torch.tanh(x)
238
+
239
+ return x
240
+
241
+ def remove_weight_norm(self):
242
+ for l in self.ups:
243
+ remove_weight_norm(l)
244
+ for l in self.resblocks:
245
+ l.remove_weight_norm()
246
+ class SineGen(torch.nn.Module):
247
+ """ Definition of sine generator
248
+ SineGen(samp_rate, harmonic_num = 0,
249
+ sine_amp = 0.1, noise_std = 0.003,
250
+ voiced_threshold = 0,
251
+ flag_for_pulse=False)
252
+ samp_rate: sampling rate in Hz
253
+ harmonic_num: number of harmonic overtones (default 0)
254
+ sine_amp: amplitude of sine-wavefrom (default 0.1)
255
+ noise_std: std of Gaussian noise (default 0.003)
256
+ voiced_thoreshold: F0 threshold for U/V classification (default 0)
257
+ flag_for_pulse: this SinGen is used inside PulseGen (default False)
258
+ Note: when flag_for_pulse is True, the first time step of a voiced
259
+ segment is always sin(np.pi) or cos(0)
260
+ """
261
+
262
+ def __init__(self, samp_rate, harmonic_num=0,
263
+ sine_amp=0.1, noise_std=0.003,
264
+ voiced_threshold=0,
265
+ flag_for_pulse=False):
266
+ super(SineGen, self).__init__()
267
+ self.sine_amp = sine_amp
268
+ self.noise_std = noise_std
269
+ self.harmonic_num = harmonic_num
270
+ self.dim = self.harmonic_num + 1
271
+ self.sampling_rate = samp_rate
272
+ self.voiced_threshold = voiced_threshold
273
+
274
+ def _f02uv(self, f0):
275
+ # generate uv signal
276
+ uv = torch.ones_like(f0)
277
+ uv = uv * (f0 > self.voiced_threshold)
278
+ return uv
279
+
280
+ def forward(self, f0,upp):
281
+ """ sine_tensor, uv = forward(f0)
282
+ input F0: tensor(batchsize=1, length, dim=1)
283
+ f0 for unvoiced steps should be 0
284
+ output sine_tensor: tensor(batchsize=1, length, dim)
285
+ output uv: tensor(batchsize=1, length, 1)
286
+ """
287
+ with torch.no_grad():
288
+ f0 = f0[:, None].transpose(1, 2)
289
+ f0_buf = torch.zeros(f0.shape[0], f0.shape[1], self.dim,device=f0.device)
290
+ # fundamental component
291
+ f0_buf[:, :, 0] = f0[:, :, 0]
292
+ for idx in np.arange(self.harmonic_num):f0_buf[:, :, idx + 1] = f0_buf[:, :, 0] * (idx + 2)# idx + 2: the (idx+1)-th overtone, (idx+2)-th harmonic
293
+ rad_values = (f0_buf / self.sampling_rate) % 1###%1意味着n_har的乘积无法后处理优化
294
+ rand_ini = torch.rand(f0_buf.shape[0], f0_buf.shape[2], device=f0_buf.device)
295
+ rand_ini[:, 0] = 0
296
+ rad_values[:, 0, :] = rad_values[:, 0, :] + rand_ini
297
+ tmp_over_one = torch.cumsum(rad_values, 1)# % 1 #####%1意味着后面的cumsum无法再优化
298
+ tmp_over_one*=upp
299
+ tmp_over_one=F.interpolate(tmp_over_one.transpose(2, 1), scale_factor=upp, mode='linear', align_corners=True).transpose(2, 1)
300
+ rad_values=F.interpolate(rad_values.transpose(2, 1), scale_factor=upp, mode='nearest').transpose(2, 1)#######
301
+ tmp_over_one%=1
302
+ tmp_over_one_idx = (tmp_over_one[:, 1:, :] - tmp_over_one[:, :-1, :]) < 0
303
+ cumsum_shift = torch.zeros_like(rad_values)
304
+ cumsum_shift[:, 1:, :] = tmp_over_one_idx * -1.0
305
+ sine_waves = torch.sin(torch.cumsum(rad_values + cumsum_shift, dim=1) * 2 * np.pi)
306
+ sine_waves = sine_waves * self.sine_amp
307
+ uv = self._f02uv(f0)
308
+ uv = F.interpolate(uv.transpose(2, 1), scale_factor=upp, mode='nearest').transpose(2, 1)
309
+ noise_amp = uv * self.noise_std + (1 - uv) * self.sine_amp / 3
310
+ noise = noise_amp * torch.randn_like(sine_waves)
311
+ sine_waves = sine_waves * uv + noise
312
+ return sine_waves, uv, noise
313
+ class SourceModuleHnNSF(torch.nn.Module):
314
+ """ SourceModule for hn-nsf
315
+ SourceModule(sampling_rate, harmonic_num=0, sine_amp=0.1,
316
+ add_noise_std=0.003, voiced_threshod=0)
317
+ sampling_rate: sampling_rate in Hz
318
+ harmonic_num: number of harmonic above F0 (default: 0)
319
+ sine_amp: amplitude of sine source signal (default: 0.1)
320
+ add_noise_std: std of additive Gaussian noise (default: 0.003)
321
+ note that amplitude of noise in unvoiced is decided
322
+ by sine_amp
323
+ voiced_threshold: threhold to set U/V given F0 (default: 0)
324
+ Sine_source, noise_source = SourceModuleHnNSF(F0_sampled)
325
+ F0_sampled (batchsize, length, 1)
326
+ Sine_source (batchsize, length, 1)
327
+ noise_source (batchsize, length 1)
328
+ uv (batchsize, length, 1)
329
+ """
330
+
331
+ def __init__(self, sampling_rate, harmonic_num=0, sine_amp=0.1,
332
+ add_noise_std=0.003, voiced_threshod=0,is_half=True):
333
+ super(SourceModuleHnNSF, self).__init__()
334
+
335
+ self.sine_amp = sine_amp
336
+ self.noise_std = add_noise_std
337
+ self.is_half=is_half
338
+ # to produce sine waveforms
339
+ self.l_sin_gen = SineGen(sampling_rate, harmonic_num,
340
+ sine_amp, add_noise_std, voiced_threshod)
341
+
342
+ # to merge source harmonics into a single excitation
343
+ self.l_linear = torch.nn.Linear(harmonic_num + 1, 1)
344
+ self.l_tanh = torch.nn.Tanh()
345
+
346
+ def forward(self, x,upp=None):
347
+ sine_wavs, uv, _ = self.l_sin_gen(x,upp)
348
+ if(self.is_half):sine_wavs=sine_wavs.half()
349
+ sine_merge = self.l_tanh(self.l_linear(sine_wavs))
350
+ return sine_merge,None,None# noise, uv
351
+ class GeneratorNSF(torch.nn.Module):
352
+ def __init__(
353
+ self,
354
+ initial_channel,
355
+ resblock,
356
+ resblock_kernel_sizes,
357
+ resblock_dilation_sizes,
358
+ upsample_rates,
359
+ upsample_initial_channel,
360
+ upsample_kernel_sizes,
361
+ gin_channels,
362
+ sr,
363
+ is_half=False
364
+ ):
365
+ super(GeneratorNSF, self).__init__()
366
+ self.num_kernels = len(resblock_kernel_sizes)
367
+ self.num_upsamples = len(upsample_rates)
368
+
369
+ self.f0_upsamp = torch.nn.Upsample(scale_factor=np.prod(upsample_rates))
370
+ self.m_source = SourceModuleHnNSF(
371
+ sampling_rate=sr,
372
+ harmonic_num=0,
373
+ is_half=is_half
374
+ )
375
+ self.noise_convs = nn.ModuleList()
376
+ self.conv_pre = Conv1d(
377
+ initial_channel, upsample_initial_channel, 7, 1, padding=3
378
+ )
379
+ resblock = modules.ResBlock1 if resblock == "1" else modules.ResBlock2
380
+
381
+ self.ups = nn.ModuleList()
382
+ for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
383
+ c_cur = upsample_initial_channel // (2 ** (i + 1))
384
+ self.ups.append(
385
+ weight_norm(
386
+ ConvTranspose1d(
387
+ upsample_initial_channel // (2**i),
388
+ upsample_initial_channel // (2 ** (i + 1)),
389
+ k,
390
+ u,
391
+ padding=(k - u) // 2,
392
+ )
393
+ )
394
+ )
395
+ if i + 1 < len(upsample_rates):
396
+ stride_f0 = np.prod(upsample_rates[i + 1:])
397
+ self.noise_convs.append(Conv1d(
398
+ 1, c_cur, kernel_size=stride_f0 * 2, stride=stride_f0, padding=stride_f0 // 2))
399
+ else:
400
+ self.noise_convs.append(Conv1d(1, c_cur, kernel_size=1))
401
+
402
+ self.resblocks = nn.ModuleList()
403
+ for i in range(len(self.ups)):
404
+ ch = upsample_initial_channel // (2 ** (i + 1))
405
+ for j, (k, d) in enumerate(
406
+ zip(resblock_kernel_sizes, resblock_dilation_sizes)
407
+ ):
408
+ self.resblocks.append(resblock(ch, k, d))
409
+
410
+ self.conv_post = Conv1d(ch, 1, 7, 1, padding=3, bias=False)
411
+ self.ups.apply(init_weights)
412
+
413
+ if gin_channels != 0:
414
+ self.cond = nn.Conv1d(gin_channels, upsample_initial_channel, 1)
415
+
416
+ self.upp=np.prod(upsample_rates)
417
+
418
+ def forward(self, x, f0,g=None):
419
+ har_source, noi_source, uv = self.m_source(f0,self.upp)
420
+ har_source = har_source.transpose(1, 2)
421
+ x = self.conv_pre(x)
422
+ if g is not None:
423
+ x = x + self.cond(g)
424
+
425
+ for i in range(self.num_upsamples):
426
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
427
+ x = self.ups[i](x)
428
+ x_source = self.noise_convs[i](har_source)
429
+ x = x + x_source
430
+ xs = None
431
+ for j in range(self.num_kernels):
432
+ if xs is None:
433
+ xs = self.resblocks[i * self.num_kernels + j](x)
434
+ else:
435
+ xs += self.resblocks[i * self.num_kernels + j](x)
436
+ x = xs / self.num_kernels
437
+ x = F.leaky_relu(x)
438
+ x = self.conv_post(x)
439
+ x = torch.tanh(x)
440
+ return x
441
+
442
+ def remove_weight_norm(self):
443
+ for l in self.ups:
444
+ remove_weight_norm(l)
445
+ for l in self.resblocks:
446
+ l.remove_weight_norm()
447
+ sr2sr={
448
+ "32k":32000,
449
+ "40k":40000,
450
+ "48k":48000,
451
+ }
452
+ class SynthesizerTrnMs256NSFsid(nn.Module):
453
+ def __init__(
454
+ self,
455
+ spec_channels,
456
+ segment_size,
457
+ inter_channels,
458
+ hidden_channels,
459
+ filter_channels,
460
+ n_heads,
461
+ n_layers,
462
+ kernel_size,
463
+ p_dropout,
464
+ resblock,
465
+ resblock_kernel_sizes,
466
+ resblock_dilation_sizes,
467
+ upsample_rates,
468
+ upsample_initial_channel,
469
+ upsample_kernel_sizes,
470
+ spk_embed_dim,
471
+ gin_channels,
472
+ sr,
473
+ **kwargs
474
+ ):
475
+
476
+ super().__init__()
477
+ if(type(sr)==type("strr")):
478
+ sr=sr2sr[sr]
479
+ self.spec_channels = spec_channels
480
+ self.inter_channels = inter_channels
481
+ self.hidden_channels = hidden_channels
482
+ self.filter_channels = filter_channels
483
+ self.n_heads = n_heads
484
+ self.n_layers = n_layers
485
+ self.kernel_size = kernel_size
486
+ self.p_dropout = p_dropout
487
+ self.resblock = resblock
488
+ self.resblock_kernel_sizes = resblock_kernel_sizes
489
+ self.resblock_dilation_sizes = resblock_dilation_sizes
490
+ self.upsample_rates = upsample_rates
491
+ self.upsample_initial_channel = upsample_initial_channel
492
+ self.upsample_kernel_sizes = upsample_kernel_sizes
493
+ self.segment_size = segment_size
494
+ self.gin_channels = gin_channels
495
+ # self.hop_length = hop_length#
496
+ self.spk_embed_dim=spk_embed_dim
497
+ self.enc_p = TextEncoder256(
498
+ inter_channels,
499
+ hidden_channels,
500
+ filter_channels,
501
+ n_heads,
502
+ n_layers,
503
+ kernel_size,
504
+ p_dropout,
505
+ )
506
+ self.dec = GeneratorNSF(
507
+ inter_channels,
508
+ resblock,
509
+ resblock_kernel_sizes,
510
+ resblock_dilation_sizes,
511
+ upsample_rates,
512
+ upsample_initial_channel,
513
+ upsample_kernel_sizes,
514
+ gin_channels=gin_channels, sr=sr, is_half=kwargs["is_half"]
515
+ )
516
+ self.enc_q = PosteriorEncoder(
517
+ spec_channels,
518
+ inter_channels,
519
+ hidden_channels,
520
+ 5,
521
+ 1,
522
+ 16,
523
+ gin_channels=gin_channels,
524
+ )
525
+ self.flow = ResidualCouplingBlock(
526
+ inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
527
+ )
528
+ self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
529
+ print("gin_channels:",gin_channels,"self.spk_embed_dim:",self.spk_embed_dim)
530
+ def remove_weight_norm(self):
531
+ self.dec.remove_weight_norm()
532
+ self.flow.remove_weight_norm()
533
+ self.enc_q.remove_weight_norm()
534
+
535
+ def forward(self, phone, phone_lengths, pitch, nsff0 ,sid, rnd, max_len=None):
536
+
537
+ g = self.emb_g(sid).unsqueeze(-1)
538
+ m_p, logs_p, x_mask = self.enc_p(phone, pitch, phone_lengths)
539
+ z_p = (m_p + torch.exp(logs_p) * rnd) * x_mask
540
+ z = self.flow(z_p, x_mask, g=g, reverse=True)
541
+ o = self.dec((z * x_mask)[:, :, :max_len], nsff0,g=g)
542
+ return o
543
+
544
+ class SynthesizerTrnMs256NSFsid_sim(nn.Module):
545
+ """
546
+ Synthesizer for Training
547
+ """
548
+
549
+ def __init__(
550
+ self,
551
+ spec_channels,
552
+ segment_size,
553
+ inter_channels,
554
+ hidden_channels,
555
+ filter_channels,
556
+ n_heads,
557
+ n_layers,
558
+ kernel_size,
559
+ p_dropout,
560
+ resblock,
561
+ resblock_kernel_sizes,
562
+ resblock_dilation_sizes,
563
+ upsample_rates,
564
+ upsample_initial_channel,
565
+ upsample_kernel_sizes,
566
+ spk_embed_dim,
567
+ # hop_length,
568
+ gin_channels=0,
569
+ use_sdp=True,
570
+ **kwargs
571
+ ):
572
+
573
+ super().__init__()
574
+ self.spec_channels = spec_channels
575
+ self.inter_channels = inter_channels
576
+ self.hidden_channels = hidden_channels
577
+ self.filter_channels = filter_channels
578
+ self.n_heads = n_heads
579
+ self.n_layers = n_layers
580
+ self.kernel_size = kernel_size
581
+ self.p_dropout = p_dropout
582
+ self.resblock = resblock
583
+ self.resblock_kernel_sizes = resblock_kernel_sizes
584
+ self.resblock_dilation_sizes = resblock_dilation_sizes
585
+ self.upsample_rates = upsample_rates
586
+ self.upsample_initial_channel = upsample_initial_channel
587
+ self.upsample_kernel_sizes = upsample_kernel_sizes
588
+ self.segment_size = segment_size
589
+ self.gin_channels = gin_channels
590
+ # self.hop_length = hop_length#
591
+ self.spk_embed_dim=spk_embed_dim
592
+ self.enc_p = TextEncoder256Sim(
593
+ inter_channels,
594
+ hidden_channels,
595
+ filter_channels,
596
+ n_heads,
597
+ n_layers,
598
+ kernel_size,
599
+ p_dropout,
600
+ )
601
+ self.dec = GeneratorNSF(
602
+ inter_channels,
603
+ resblock,
604
+ resblock_kernel_sizes,
605
+ resblock_dilation_sizes,
606
+ upsample_rates,
607
+ upsample_initial_channel,
608
+ upsample_kernel_sizes,
609
+ gin_channels=gin_channels,is_half=kwargs["is_half"]
610
+ )
611
+
612
+ self.flow = ResidualCouplingBlock(
613
+ inter_channels, hidden_channels, 5, 1, 3, gin_channels=gin_channels
614
+ )
615
+ self.emb_g = nn.Embedding(self.spk_embed_dim, gin_channels)
616
+ print("gin_channels:",gin_channels,"self.spk_embed_dim:",self.spk_embed_dim)
617
+ def remove_weight_norm(self):
618
+ self.dec.remove_weight_norm()
619
+ self.flow.remove_weight_norm()
620
+ self.enc_q.remove_weight_norm()
621
+
622
+ def forward(self, phone, phone_lengths, pitch, pitchf, ds,max_len=None): # y是spec不需要了现在
623
+ g = self.emb_g(ds.unsqueeze(0)).unsqueeze(-1) # [b, 256, 1]##1是t,广播的
624
+ x, x_mask = self.enc_p(phone, pitch, phone_lengths)
625
+ x = self.flow(x, x_mask, g=g, reverse=True)
626
+ o = self.dec((x*x_mask)[:, :, :max_len], pitchf, g=g)
627
+ return o
628
+
629
+ class MultiPeriodDiscriminator(torch.nn.Module):
630
+ def __init__(self, use_spectral_norm=False):
631
+ super(MultiPeriodDiscriminator, self).__init__()
632
+ periods = [2, 3, 5, 7, 11,17]
633
+ # periods = [3, 5, 7, 11, 17, 23, 37]
634
+
635
+ discs = [DiscriminatorS(use_spectral_norm=use_spectral_norm)]
636
+ discs = discs + [
637
+ DiscriminatorP(i, use_spectral_norm=use_spectral_norm) for i in periods
638
+ ]
639
+ self.discriminators = nn.ModuleList(discs)
640
+
641
+ def forward(self, y, y_hat):
642
+ y_d_rs = []#
643
+ y_d_gs = []
644
+ fmap_rs = []
645
+ fmap_gs = []
646
+ for i, d in enumerate(self.discriminators):
647
+ y_d_r, fmap_r = d(y)
648
+ y_d_g, fmap_g = d(y_hat)
649
+ # for j in range(len(fmap_r)):
650
+ # print(i,j,y.shape,y_hat.shape,fmap_r[j].shape,fmap_g[j].shape)
651
+ y_d_rs.append(y_d_r)
652
+ y_d_gs.append(y_d_g)
653
+ fmap_rs.append(fmap_r)
654
+ fmap_gs.append(fmap_g)
655
+
656
+ return y_d_rs, y_d_gs, fmap_rs, fmap_gs
657
+
658
+ class DiscriminatorS(torch.nn.Module):
659
+ def __init__(self, use_spectral_norm=False):
660
+ super(DiscriminatorS, self).__init__()
661
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
662
+ self.convs = nn.ModuleList(
663
+ [
664
+ norm_f(Conv1d(1, 16, 15, 1, padding=7)),
665
+ norm_f(Conv1d(16, 64, 41, 4, groups=4, padding=20)),
666
+ norm_f(Conv1d(64, 256, 41, 4, groups=16, padding=20)),
667
+ norm_f(Conv1d(256, 1024, 41, 4, groups=64, padding=20)),
668
+ norm_f(Conv1d(1024, 1024, 41, 4, groups=256, padding=20)),
669
+ norm_f(Conv1d(1024, 1024, 5, 1, padding=2)),
670
+ ]
671
+ )
672
+ self.conv_post = norm_f(Conv1d(1024, 1, 3, 1, padding=1))
673
+
674
+ def forward(self, x):
675
+ fmap = []
676
+
677
+ for l in self.convs:
678
+ x = l(x)
679
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
680
+ fmap.append(x)
681
+ x = self.conv_post(x)
682
+ fmap.append(x)
683
+ x = torch.flatten(x, 1, -1)
684
+
685
+ return x, fmap
686
+
687
+ class DiscriminatorP(torch.nn.Module):
688
+ def __init__(self, period, kernel_size=5, stride=3, use_spectral_norm=False):
689
+ super(DiscriminatorP, self).__init__()
690
+ self.period = period
691
+ self.use_spectral_norm = use_spectral_norm
692
+ norm_f = weight_norm if use_spectral_norm == False else spectral_norm
693
+ self.convs = nn.ModuleList(
694
+ [
695
+ norm_f(
696
+ Conv2d(
697
+ 1,
698
+ 32,
699
+ (kernel_size, 1),
700
+ (stride, 1),
701
+ padding=(get_padding(kernel_size, 1), 0),
702
+ )
703
+ ),
704
+ norm_f(
705
+ Conv2d(
706
+ 32,
707
+ 128,
708
+ (kernel_size, 1),
709
+ (stride, 1),
710
+ padding=(get_padding(kernel_size, 1), 0),
711
+ )
712
+ ),
713
+ norm_f(
714
+ Conv2d(
715
+ 128,
716
+ 512,
717
+ (kernel_size, 1),
718
+ (stride, 1),
719
+ padding=(get_padding(kernel_size, 1), 0),
720
+ )
721
+ ),
722
+ norm_f(
723
+ Conv2d(
724
+ 512,
725
+ 1024,
726
+ (kernel_size, 1),
727
+ (stride, 1),
728
+ padding=(get_padding(kernel_size, 1), 0),
729
+ )
730
+ ),
731
+ norm_f(
732
+ Conv2d(
733
+ 1024,
734
+ 1024,
735
+ (kernel_size, 1),
736
+ 1,
737
+ padding=(get_padding(kernel_size, 1), 0),
738
+ )
739
+ ),
740
+ ]
741
+ )
742
+ self.conv_post = norm_f(Conv2d(1024, 1, (3, 1), 1, padding=(1, 0)))
743
+
744
+ def forward(self, x):
745
+ fmap = []
746
+
747
+ # 1d to 2d
748
+ b, c, t = x.shape
749
+ if t % self.period != 0: # pad first
750
+ n_pad = self.period - (t % self.period)
751
+ x = F.pad(x, (0, n_pad), "reflect")
752
+ t = t + n_pad
753
+ x = x.view(b, c, t // self.period, self.period)
754
+
755
+ for l in self.convs:
756
+ x = l(x)
757
+ x = F.leaky_relu(x, modules.LRELU_SLOPE)
758
+ fmap.append(x)
759
+ x = self.conv_post(x)
760
+ fmap.append(x)
761
+ x = torch.flatten(x, 1, -1)
762
+
763
+ return x, fmap
764
+
infer_pack/modules.py ADDED
@@ -0,0 +1,522 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import copy
2
+ import math
3
+ import numpy as np
4
+ import scipy
5
+ import torch
6
+ from torch import nn
7
+ from torch.nn import functional as F
8
+
9
+ from torch.nn import Conv1d, ConvTranspose1d, AvgPool1d, Conv2d
10
+ from torch.nn.utils import weight_norm, remove_weight_norm
11
+
12
+ from infer_pack import commons
13
+ from infer_pack.commons import init_weights, get_padding
14
+ from infer_pack.transforms import piecewise_rational_quadratic_transform
15
+
16
+
17
+ LRELU_SLOPE = 0.1
18
+
19
+
20
+ class LayerNorm(nn.Module):
21
+ def __init__(self, channels, eps=1e-5):
22
+ super().__init__()
23
+ self.channels = channels
24
+ self.eps = eps
25
+
26
+ self.gamma = nn.Parameter(torch.ones(channels))
27
+ self.beta = nn.Parameter(torch.zeros(channels))
28
+
29
+ def forward(self, x):
30
+ x = x.transpose(1, -1)
31
+ x = F.layer_norm(x, (self.channels,), self.gamma, self.beta, self.eps)
32
+ return x.transpose(1, -1)
33
+
34
+
35
+ class ConvReluNorm(nn.Module):
36
+ def __init__(
37
+ self,
38
+ in_channels,
39
+ hidden_channels,
40
+ out_channels,
41
+ kernel_size,
42
+ n_layers,
43
+ p_dropout,
44
+ ):
45
+ super().__init__()
46
+ self.in_channels = in_channels
47
+ self.hidden_channels = hidden_channels
48
+ self.out_channels = out_channels
49
+ self.kernel_size = kernel_size
50
+ self.n_layers = n_layers
51
+ self.p_dropout = p_dropout
52
+ assert n_layers > 1, "Number of layers should be larger than 0."
53
+
54
+ self.conv_layers = nn.ModuleList()
55
+ self.norm_layers = nn.ModuleList()
56
+ self.conv_layers.append(
57
+ nn.Conv1d(
58
+ in_channels, hidden_channels, kernel_size, padding=kernel_size // 2
59
+ )
60
+ )
61
+ self.norm_layers.append(LayerNorm(hidden_channels))
62
+ self.relu_drop = nn.Sequential(nn.ReLU(), nn.Dropout(p_dropout))
63
+ for _ in range(n_layers - 1):
64
+ self.conv_layers.append(
65
+ nn.Conv1d(
66
+ hidden_channels,
67
+ hidden_channels,
68
+ kernel_size,
69
+ padding=kernel_size // 2,
70
+ )
71
+ )
72
+ self.norm_layers.append(LayerNorm(hidden_channels))
73
+ self.proj = nn.Conv1d(hidden_channels, out_channels, 1)
74
+ self.proj.weight.data.zero_()
75
+ self.proj.bias.data.zero_()
76
+
77
+ def forward(self, x, x_mask):
78
+ x_org = x
79
+ for i in range(self.n_layers):
80
+ x = self.conv_layers[i](x * x_mask)
81
+ x = self.norm_layers[i](x)
82
+ x = self.relu_drop(x)
83
+ x = x_org + self.proj(x)
84
+ return x * x_mask
85
+
86
+
87
+ class DDSConv(nn.Module):
88
+ """
89
+ Dialted and Depth-Separable Convolution
90
+ """
91
+
92
+ def __init__(self, channels, kernel_size, n_layers, p_dropout=0.0):
93
+ super().__init__()
94
+ self.channels = channels
95
+ self.kernel_size = kernel_size
96
+ self.n_layers = n_layers
97
+ self.p_dropout = p_dropout
98
+
99
+ self.drop = nn.Dropout(p_dropout)
100
+ self.convs_sep = nn.ModuleList()
101
+ self.convs_1x1 = nn.ModuleList()
102
+ self.norms_1 = nn.ModuleList()
103
+ self.norms_2 = nn.ModuleList()
104
+ for i in range(n_layers):
105
+ dilation = kernel_size**i
106
+ padding = (kernel_size * dilation - dilation) // 2
107
+ self.convs_sep.append(
108
+ nn.Conv1d(
109
+ channels,
110
+ channels,
111
+ kernel_size,
112
+ groups=channels,
113
+ dilation=dilation,
114
+ padding=padding,
115
+ )
116
+ )
117
+ self.convs_1x1.append(nn.Conv1d(channels, channels, 1))
118
+ self.norms_1.append(LayerNorm(channels))
119
+ self.norms_2.append(LayerNorm(channels))
120
+
121
+ def forward(self, x, x_mask, g=None):
122
+ if g is not None:
123
+ x = x + g
124
+ for i in range(self.n_layers):
125
+ y = self.convs_sep[i](x * x_mask)
126
+ y = self.norms_1[i](y)
127
+ y = F.gelu(y)
128
+ y = self.convs_1x1[i](y)
129
+ y = self.norms_2[i](y)
130
+ y = F.gelu(y)
131
+ y = self.drop(y)
132
+ x = x + y
133
+ return x * x_mask
134
+
135
+
136
+ class WN(torch.nn.Module):
137
+ def __init__(
138
+ self,
139
+ hidden_channels,
140
+ kernel_size,
141
+ dilation_rate,
142
+ n_layers,
143
+ gin_channels=0,
144
+ p_dropout=0,
145
+ ):
146
+ super(WN, self).__init__()
147
+ assert kernel_size % 2 == 1
148
+ self.hidden_channels = hidden_channels
149
+ self.kernel_size = (kernel_size,)
150
+ self.dilation_rate = dilation_rate
151
+ self.n_layers = n_layers
152
+ self.gin_channels = gin_channels
153
+ self.p_dropout = p_dropout
154
+
155
+ self.in_layers = torch.nn.ModuleList()
156
+ self.res_skip_layers = torch.nn.ModuleList()
157
+ self.drop = nn.Dropout(p_dropout)
158
+
159
+ if gin_channels != 0:
160
+ cond_layer = torch.nn.Conv1d(
161
+ gin_channels, 2 * hidden_channels * n_layers, 1
162
+ )
163
+ self.cond_layer = torch.nn.utils.weight_norm(cond_layer, name="weight")
164
+
165
+ for i in range(n_layers):
166
+ dilation = dilation_rate**i
167
+ padding = int((kernel_size * dilation - dilation) / 2)
168
+ in_layer = torch.nn.Conv1d(
169
+ hidden_channels,
170
+ 2 * hidden_channels,
171
+ kernel_size,
172
+ dilation=dilation,
173
+ padding=padding,
174
+ )
175
+ in_layer = torch.nn.utils.weight_norm(in_layer, name="weight")
176
+ self.in_layers.append(in_layer)
177
+
178
+ # last one is not necessary
179
+ if i < n_layers - 1:
180
+ res_skip_channels = 2 * hidden_channels
181
+ else:
182
+ res_skip_channels = hidden_channels
183
+
184
+ res_skip_layer = torch.nn.Conv1d(hidden_channels, res_skip_channels, 1)
185
+ res_skip_layer = torch.nn.utils.weight_norm(res_skip_layer, name="weight")
186
+ self.res_skip_layers.append(res_skip_layer)
187
+
188
+ def forward(self, x, x_mask, g=None, **kwargs):
189
+ output = torch.zeros_like(x)
190
+ n_channels_tensor = torch.IntTensor([self.hidden_channels])
191
+
192
+ if g is not None:
193
+ g = self.cond_layer(g)
194
+
195
+ for i in range(self.n_layers):
196
+ x_in = self.in_layers[i](x)
197
+ if g is not None:
198
+ cond_offset = i * 2 * self.hidden_channels
199
+ g_l = g[:, cond_offset : cond_offset + 2 * self.hidden_channels, :]
200
+ else:
201
+ g_l = torch.zeros_like(x_in)
202
+
203
+ acts = commons.fused_add_tanh_sigmoid_multiply(x_in, g_l, n_channels_tensor)
204
+ acts = self.drop(acts)
205
+
206
+ res_skip_acts = self.res_skip_layers[i](acts)
207
+ if i < self.n_layers - 1:
208
+ res_acts = res_skip_acts[:, : self.hidden_channels, :]
209
+ x = (x + res_acts) * x_mask
210
+ output = output + res_skip_acts[:, self.hidden_channels :, :]
211
+ else:
212
+ output = output + res_skip_acts
213
+ return output * x_mask
214
+
215
+ def remove_weight_norm(self):
216
+ if self.gin_channels != 0:
217
+ torch.nn.utils.remove_weight_norm(self.cond_layer)
218
+ for l in self.in_layers:
219
+ torch.nn.utils.remove_weight_norm(l)
220
+ for l in self.res_skip_layers:
221
+ torch.nn.utils.remove_weight_norm(l)
222
+
223
+
224
+ class ResBlock1(torch.nn.Module):
225
+ def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5)):
226
+ super(ResBlock1, self).__init__()
227
+ self.convs1 = nn.ModuleList(
228
+ [
229
+ weight_norm(
230
+ Conv1d(
231
+ channels,
232
+ channels,
233
+ kernel_size,
234
+ 1,
235
+ dilation=dilation[0],
236
+ padding=get_padding(kernel_size, dilation[0]),
237
+ )
238
+ ),
239
+ weight_norm(
240
+ Conv1d(
241
+ channels,
242
+ channels,
243
+ kernel_size,
244
+ 1,
245
+ dilation=dilation[1],
246
+ padding=get_padding(kernel_size, dilation[1]),
247
+ )
248
+ ),
249
+ weight_norm(
250
+ Conv1d(
251
+ channels,
252
+ channels,
253
+ kernel_size,
254
+ 1,
255
+ dilation=dilation[2],
256
+ padding=get_padding(kernel_size, dilation[2]),
257
+ )
258
+ ),
259
+ ]
260
+ )
261
+ self.convs1.apply(init_weights)
262
+
263
+ self.convs2 = nn.ModuleList(
264
+ [
265
+ weight_norm(
266
+ Conv1d(
267
+ channels,
268
+ channels,
269
+ kernel_size,
270
+ 1,
271
+ dilation=1,
272
+ padding=get_padding(kernel_size, 1),
273
+ )
274
+ ),
275
+ weight_norm(
276
+ Conv1d(
277
+ channels,
278
+ channels,
279
+ kernel_size,
280
+ 1,
281
+ dilation=1,
282
+ padding=get_padding(kernel_size, 1),
283
+ )
284
+ ),
285
+ weight_norm(
286
+ Conv1d(
287
+ channels,
288
+ channels,
289
+ kernel_size,
290
+ 1,
291
+ dilation=1,
292
+ padding=get_padding(kernel_size, 1),
293
+ )
294
+ ),
295
+ ]
296
+ )
297
+ self.convs2.apply(init_weights)
298
+
299
+ def forward(self, x, x_mask=None):
300
+ for c1, c2 in zip(self.convs1, self.convs2):
301
+ xt = F.leaky_relu(x, LRELU_SLOPE)
302
+ if x_mask is not None:
303
+ xt = xt * x_mask
304
+ xt = c1(xt)
305
+ xt = F.leaky_relu(xt, LRELU_SLOPE)
306
+ if x_mask is not None:
307
+ xt = xt * x_mask
308
+ xt = c2(xt)
309
+ x = xt + x
310
+ if x_mask is not None:
311
+ x = x * x_mask
312
+ return x
313
+
314
+ def remove_weight_norm(self):
315
+ for l in self.convs1:
316
+ remove_weight_norm(l)
317
+ for l in self.convs2:
318
+ remove_weight_norm(l)
319
+
320
+
321
+ class ResBlock2(torch.nn.Module):
322
+ def __init__(self, channels, kernel_size=3, dilation=(1, 3)):
323
+ super(ResBlock2, self).__init__()
324
+ self.convs = nn.ModuleList(
325
+ [
326
+ weight_norm(
327
+ Conv1d(
328
+ channels,
329
+ channels,
330
+ kernel_size,
331
+ 1,
332
+ dilation=dilation[0],
333
+ padding=get_padding(kernel_size, dilation[0]),
334
+ )
335
+ ),
336
+ weight_norm(
337
+ Conv1d(
338
+ channels,
339
+ channels,
340
+ kernel_size,
341
+ 1,
342
+ dilation=dilation[1],
343
+ padding=get_padding(kernel_size, dilation[1]),
344
+ )
345
+ ),
346
+ ]
347
+ )
348
+ self.convs.apply(init_weights)
349
+
350
+ def forward(self, x, x_mask=None):
351
+ for c in self.convs:
352
+ xt = F.leaky_relu(x, LRELU_SLOPE)
353
+ if x_mask is not None:
354
+ xt = xt * x_mask
355
+ xt = c(xt)
356
+ x = xt + x
357
+ if x_mask is not None:
358
+ x = x * x_mask
359
+ return x
360
+
361
+ def remove_weight_norm(self):
362
+ for l in self.convs:
363
+ remove_weight_norm(l)
364
+
365
+
366
+ class Log(nn.Module):
367
+ def forward(self, x, x_mask, reverse=False, **kwargs):
368
+ if not reverse:
369
+ y = torch.log(torch.clamp_min(x, 1e-5)) * x_mask
370
+ logdet = torch.sum(-y, [1, 2])
371
+ return y, logdet
372
+ else:
373
+ x = torch.exp(x) * x_mask
374
+ return x
375
+
376
+
377
+ class Flip(nn.Module):
378
+ def forward(self, x, *args, reverse=False, **kwargs):
379
+ x = torch.flip(x, [1])
380
+ if not reverse:
381
+ logdet = torch.zeros(x.size(0)).to(dtype=x.dtype, device=x.device)
382
+ return x, logdet
383
+ else:
384
+ return x
385
+
386
+
387
+ class ElementwiseAffine(nn.Module):
388
+ def __init__(self, channels):
389
+ super().__init__()
390
+ self.channels = channels
391
+ self.m = nn.Parameter(torch.zeros(channels, 1))
392
+ self.logs = nn.Parameter(torch.zeros(channels, 1))
393
+
394
+ def forward(self, x, x_mask, reverse=False, **kwargs):
395
+ if not reverse:
396
+ y = self.m + torch.exp(self.logs) * x
397
+ y = y * x_mask
398
+ logdet = torch.sum(self.logs * x_mask, [1, 2])
399
+ return y, logdet
400
+ else:
401
+ x = (x - self.m) * torch.exp(-self.logs) * x_mask
402
+ return x
403
+
404
+
405
+ class ResidualCouplingLayer(nn.Module):
406
+ def __init__(
407
+ self,
408
+ channels,
409
+ hidden_channels,
410
+ kernel_size,
411
+ dilation_rate,
412
+ n_layers,
413
+ p_dropout=0,
414
+ gin_channels=0,
415
+ mean_only=False,
416
+ ):
417
+ assert channels % 2 == 0, "channels should be divisible by 2"
418
+ super().__init__()
419
+ self.channels = channels
420
+ self.hidden_channels = hidden_channels
421
+ self.kernel_size = kernel_size
422
+ self.dilation_rate = dilation_rate
423
+ self.n_layers = n_layers
424
+ self.half_channels = channels // 2
425
+ self.mean_only = mean_only
426
+
427
+ self.pre = nn.Conv1d(self.half_channels, hidden_channels, 1)
428
+ self.enc = WN(
429
+ hidden_channels,
430
+ kernel_size,
431
+ dilation_rate,
432
+ n_layers,
433
+ p_dropout=p_dropout,
434
+ gin_channels=gin_channels,
435
+ )
436
+ self.post = nn.Conv1d(hidden_channels, self.half_channels * (2 - mean_only), 1)
437
+ self.post.weight.data.zero_()
438
+ self.post.bias.data.zero_()
439
+
440
+ def forward(self, x, x_mask, g=None, reverse=False):
441
+ x0, x1 = torch.split(x, [self.half_channels] * 2, 1)
442
+ h = self.pre(x0) * x_mask
443
+ h = self.enc(h, x_mask, g=g)
444
+ stats = self.post(h) * x_mask
445
+ if not self.mean_only:
446
+ m, logs = torch.split(stats, [self.half_channels] * 2, 1)
447
+ else:
448
+ m = stats
449
+ logs = torch.zeros_like(m)
450
+
451
+ if not reverse:
452
+ x1 = m + x1 * torch.exp(logs) * x_mask
453
+ x = torch.cat([x0, x1], 1)
454
+ logdet = torch.sum(logs, [1, 2])
455
+ return x, logdet
456
+ else:
457
+ x1 = (x1 - m) * torch.exp(-logs) * x_mask
458
+ x = torch.cat([x0, x1], 1)
459
+ return x
460
+
461
+ def remove_weight_norm(self):
462
+ self.enc.remove_weight_norm()
463
+
464
+
465
+ class ConvFlow(nn.Module):
466
+ def __init__(
467
+ self,
468
+ in_channels,
469
+ filter_channels,
470
+ kernel_size,
471
+ n_layers,
472
+ num_bins=10,
473
+ tail_bound=5.0,
474
+ ):
475
+ super().__init__()
476
+ self.in_channels = in_channels
477
+ self.filter_channels = filter_channels
478
+ self.kernel_size = kernel_size
479
+ self.n_layers = n_layers
480
+ self.num_bins = num_bins
481
+ self.tail_bound = tail_bound
482
+ self.half_channels = in_channels // 2
483
+
484
+ self.pre = nn.Conv1d(self.half_channels, filter_channels, 1)
485
+ self.convs = DDSConv(filter_channels, kernel_size, n_layers, p_dropout=0.0)
486
+ self.proj = nn.Conv1d(
487
+ filter_channels, self.half_channels * (num_bins * 3 - 1), 1
488
+ )
489
+ self.proj.weight.data.zero_()
490
+ self.proj.bias.data.zero_()
491
+
492
+ def forward(self, x, x_mask, g=None, reverse=False):
493
+ x0, x1 = torch.split(x, [self.half_channels] * 2, 1)
494
+ h = self.pre(x0)
495
+ h = self.convs(h, x_mask, g=g)
496
+ h = self.proj(h) * x_mask
497
+
498
+ b, c, t = x0.shape
499
+ h = h.reshape(b, c, -1, t).permute(0, 1, 3, 2) # [b, cx?, t] -> [b, c, t, ?]
500
+
501
+ unnormalized_widths = h[..., : self.num_bins] / math.sqrt(self.filter_channels)
502
+ unnormalized_heights = h[..., self.num_bins : 2 * self.num_bins] / math.sqrt(
503
+ self.filter_channels
504
+ )
505
+ unnormalized_derivatives = h[..., 2 * self.num_bins :]
506
+
507
+ x1, logabsdet = piecewise_rational_quadratic_transform(
508
+ x1,
509
+ unnormalized_widths,
510
+ unnormalized_heights,
511
+ unnormalized_derivatives,
512
+ inverse=reverse,
513
+ tails="linear",
514
+ tail_bound=self.tail_bound,
515
+ )
516
+
517
+ x = torch.cat([x0, x1], 1) * x_mask
518
+ logdet = torch.sum(logabsdet * x_mask, [1, 2])
519
+ if not reverse:
520
+ return x, logdet
521
+ else:
522
+ return x
infer_pack/transforms.py ADDED
@@ -0,0 +1,193 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from torch.nn import functional as F
3
+
4
+ import numpy as np
5
+
6
+
7
+ DEFAULT_MIN_BIN_WIDTH = 1e-3
8
+ DEFAULT_MIN_BIN_HEIGHT = 1e-3
9
+ DEFAULT_MIN_DERIVATIVE = 1e-3
10
+
11
+
12
+ def piecewise_rational_quadratic_transform(inputs,
13
+ unnormalized_widths,
14
+ unnormalized_heights,
15
+ unnormalized_derivatives,
16
+ inverse=False,
17
+ tails=None,
18
+ tail_bound=1.,
19
+ min_bin_width=DEFAULT_MIN_BIN_WIDTH,
20
+ min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
21
+ min_derivative=DEFAULT_MIN_DERIVATIVE):
22
+
23
+ if tails is None:
24
+ spline_fn = rational_quadratic_spline
25
+ spline_kwargs = {}
26
+ else:
27
+ spline_fn = unconstrained_rational_quadratic_spline
28
+ spline_kwargs = {
29
+ 'tails': tails,
30
+ 'tail_bound': tail_bound
31
+ }
32
+
33
+ outputs, logabsdet = spline_fn(
34
+ inputs=inputs,
35
+ unnormalized_widths=unnormalized_widths,
36
+ unnormalized_heights=unnormalized_heights,
37
+ unnormalized_derivatives=unnormalized_derivatives,
38
+ inverse=inverse,
39
+ min_bin_width=min_bin_width,
40
+ min_bin_height=min_bin_height,
41
+ min_derivative=min_derivative,
42
+ **spline_kwargs
43
+ )
44
+ return outputs, logabsdet
45
+
46
+
47
+ def searchsorted(bin_locations, inputs, eps=1e-6):
48
+ bin_locations[..., -1] += eps
49
+ return torch.sum(
50
+ inputs[..., None] >= bin_locations,
51
+ dim=-1
52
+ ) - 1
53
+
54
+
55
+ def unconstrained_rational_quadratic_spline(inputs,
56
+ unnormalized_widths,
57
+ unnormalized_heights,
58
+ unnormalized_derivatives,
59
+ inverse=False,
60
+ tails='linear',
61
+ tail_bound=1.,
62
+ min_bin_width=DEFAULT_MIN_BIN_WIDTH,
63
+ min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
64
+ min_derivative=DEFAULT_MIN_DERIVATIVE):
65
+ inside_interval_mask = (inputs >= -tail_bound) & (inputs <= tail_bound)
66
+ outside_interval_mask = ~inside_interval_mask
67
+
68
+ outputs = torch.zeros_like(inputs)
69
+ logabsdet = torch.zeros_like(inputs)
70
+
71
+ if tails == 'linear':
72
+ unnormalized_derivatives = F.pad(unnormalized_derivatives, pad=(1, 1))
73
+ constant = np.log(np.exp(1 - min_derivative) - 1)
74
+ unnormalized_derivatives[..., 0] = constant
75
+ unnormalized_derivatives[..., -1] = constant
76
+
77
+ outputs[outside_interval_mask] = inputs[outside_interval_mask]
78
+ logabsdet[outside_interval_mask] = 0
79
+ else:
80
+ raise RuntimeError('{} tails are not implemented.'.format(tails))
81
+
82
+ outputs[inside_interval_mask], logabsdet[inside_interval_mask] = rational_quadratic_spline(
83
+ inputs=inputs[inside_interval_mask],
84
+ unnormalized_widths=unnormalized_widths[inside_interval_mask, :],
85
+ unnormalized_heights=unnormalized_heights[inside_interval_mask, :],
86
+ unnormalized_derivatives=unnormalized_derivatives[inside_interval_mask, :],
87
+ inverse=inverse,
88
+ left=-tail_bound, right=tail_bound, bottom=-tail_bound, top=tail_bound,
89
+ min_bin_width=min_bin_width,
90
+ min_bin_height=min_bin_height,
91
+ min_derivative=min_derivative
92
+ )
93
+
94
+ return outputs, logabsdet
95
+
96
+ def rational_quadratic_spline(inputs,
97
+ unnormalized_widths,
98
+ unnormalized_heights,
99
+ unnormalized_derivatives,
100
+ inverse=False,
101
+ left=0., right=1., bottom=0., top=1.,
102
+ min_bin_width=DEFAULT_MIN_BIN_WIDTH,
103
+ min_bin_height=DEFAULT_MIN_BIN_HEIGHT,
104
+ min_derivative=DEFAULT_MIN_DERIVATIVE):
105
+ if torch.min(inputs) < left or torch.max(inputs) > right:
106
+ raise ValueError('Input to a transform is not within its domain')
107
+
108
+ num_bins = unnormalized_widths.shape[-1]
109
+
110
+ if min_bin_width * num_bins > 1.0:
111
+ raise ValueError('Minimal bin width too large for the number of bins')
112
+ if min_bin_height * num_bins > 1.0:
113
+ raise ValueError('Minimal bin height too large for the number of bins')
114
+
115
+ widths = F.softmax(unnormalized_widths, dim=-1)
116
+ widths = min_bin_width + (1 - min_bin_width * num_bins) * widths
117
+ cumwidths = torch.cumsum(widths, dim=-1)
118
+ cumwidths = F.pad(cumwidths, pad=(1, 0), mode='constant', value=0.0)
119
+ cumwidths = (right - left) * cumwidths + left
120
+ cumwidths[..., 0] = left
121
+ cumwidths[..., -1] = right
122
+ widths = cumwidths[..., 1:] - cumwidths[..., :-1]
123
+
124
+ derivatives = min_derivative + F.softplus(unnormalized_derivatives)
125
+
126
+ heights = F.softmax(unnormalized_heights, dim=-1)
127
+ heights = min_bin_height + (1 - min_bin_height * num_bins) * heights
128
+ cumheights = torch.cumsum(heights, dim=-1)
129
+ cumheights = F.pad(cumheights, pad=(1, 0), mode='constant', value=0.0)
130
+ cumheights = (top - bottom) * cumheights + bottom
131
+ cumheights[..., 0] = bottom
132
+ cumheights[..., -1] = top
133
+ heights = cumheights[..., 1:] - cumheights[..., :-1]
134
+
135
+ if inverse:
136
+ bin_idx = searchsorted(cumheights, inputs)[..., None]
137
+ else:
138
+ bin_idx = searchsorted(cumwidths, inputs)[..., None]
139
+
140
+ input_cumwidths = cumwidths.gather(-1, bin_idx)[..., 0]
141
+ input_bin_widths = widths.gather(-1, bin_idx)[..., 0]
142
+
143
+ input_cumheights = cumheights.gather(-1, bin_idx)[..., 0]
144
+ delta = heights / widths
145
+ input_delta = delta.gather(-1, bin_idx)[..., 0]
146
+
147
+ input_derivatives = derivatives.gather(-1, bin_idx)[..., 0]
148
+ input_derivatives_plus_one = derivatives[..., 1:].gather(-1, bin_idx)[..., 0]
149
+
150
+ input_heights = heights.gather(-1, bin_idx)[..., 0]
151
+
152
+ if inverse:
153
+ a = (((inputs - input_cumheights) * (input_derivatives
154
+ + input_derivatives_plus_one
155
+ - 2 * input_delta)
156
+ + input_heights * (input_delta - input_derivatives)))
157
+ b = (input_heights * input_derivatives
158
+ - (inputs - input_cumheights) * (input_derivatives
159
+ + input_derivatives_plus_one
160
+ - 2 * input_delta))
161
+ c = - input_delta * (inputs - input_cumheights)
162
+
163
+ discriminant = b.pow(2) - 4 * a * c
164
+ assert (discriminant >= 0).all()
165
+
166
+ root = (2 * c) / (-b - torch.sqrt(discriminant))
167
+ outputs = root * input_bin_widths + input_cumwidths
168
+
169
+ theta_one_minus_theta = root * (1 - root)
170
+ denominator = input_delta + ((input_derivatives + input_derivatives_plus_one - 2 * input_delta)
171
+ * theta_one_minus_theta)
172
+ derivative_numerator = input_delta.pow(2) * (input_derivatives_plus_one * root.pow(2)
173
+ + 2 * input_delta * theta_one_minus_theta
174
+ + input_derivatives * (1 - root).pow(2))
175
+ logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator)
176
+
177
+ return outputs, -logabsdet
178
+ else:
179
+ theta = (inputs - input_cumwidths) / input_bin_widths
180
+ theta_one_minus_theta = theta * (1 - theta)
181
+
182
+ numerator = input_heights * (input_delta * theta.pow(2)
183
+ + input_derivatives * theta_one_minus_theta)
184
+ denominator = input_delta + ((input_derivatives + input_derivatives_plus_one - 2 * input_delta)
185
+ * theta_one_minus_theta)
186
+ outputs = input_cumheights + numerator / denominator
187
+
188
+ derivative_numerator = input_delta.pow(2) * (input_derivatives_plus_one * theta.pow(2)
189
+ + 2 * input_delta * theta_one_minus_theta
190
+ + input_derivatives * (1 - theta).pow(2))
191
+ logabsdet = torch.log(derivative_numerator) - 2 * torch.log(denominator)
192
+
193
+ return outputs, logabsdet
requirements.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio
2
+ torch
3
+ torchaudio
4
+ fairseq==0.12.2
5
+ scipy==1.9.3
6
+ pyworld>=0.3.2
7
+ faiss-cpu==1.7.2 ; python_version < "3.11"
8
+ faiss-cpu==1.7.3 ; python_version > "3.10"
9
+ praat-parselmouth>=0.4.3
10
+ librosa==0.9.2
11
+ edge-tts
util.py ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import asyncio
3
+ from io import BytesIO
4
+
5
+ from fairseq import checkpoint_utils
6
+
7
+ import torch
8
+
9
+ import edge_tts
10
+ import librosa
11
+
12
+
13
+ # https://github.com/fumiama/Retrieval-based-Voice-Conversion-WebUI/blob/main/config.py#L43-L55 # noqa
14
+ def has_mps() -> bool:
15
+ if sys.platform != "darwin":
16
+ return False
17
+ else:
18
+ if not getattr(torch, 'has_mps', False):
19
+ return False
20
+
21
+ try:
22
+ torch.zeros(1).to(torch.device("mps"))
23
+ return True
24
+ except Exception:
25
+ return False
26
+
27
+
28
+ # https://github.com/fumiama/Retrieval-based-Voice-Conversion-WebUI/blob/main/config.py#L58-L71 # noqa
29
+ def is_half(device: str) -> bool:
30
+ if device == 'cpu':
31
+ return False
32
+ else:
33
+ if has_mps():
34
+ return True
35
+
36
+ gpu_name = torch.cuda.get_device_name(int(device.split(':')[-1]))
37
+ if '16' in gpu_name or 'MX' in gpu_name:
38
+ return False
39
+
40
+ return True
41
+
42
+
43
+ def load_hubert_model(device: str, model_path: str = 'hubert_base.pt'):
44
+ model = checkpoint_utils.load_model_ensemble_and_task(
45
+ [model_path]
46
+ )[0][0].to(device)
47
+
48
+ if is_half(device):
49
+ return model.half()
50
+ else:
51
+ return model.float()
52
+
53
+
54
+ async def call_edge_tts(speaker_name: str, text: str):
55
+ tts_com = edge_tts.Communicate(text, speaker_name)
56
+ tts_raw = b''
57
+
58
+ # Stream TTS audio to bytes
59
+ async for chunk in tts_com.stream():
60
+ if chunk['type'] == 'audio':
61
+ tts_raw += chunk['data']
62
+
63
+ # Convert mp3 stream to wav
64
+ ffmpeg_proc = await asyncio.create_subprocess_exec(
65
+ 'ffmpeg',
66
+ '-f', 'mp3',
67
+ '-i', '-',
68
+ '-f', 'wav',
69
+ '-',
70
+ stdin=asyncio.subprocess.PIPE,
71
+ stdout=asyncio.subprocess.PIPE
72
+ )
73
+ (tts_wav, _) = await ffmpeg_proc.communicate(tts_raw)
74
+
75
+ return librosa.load(BytesIO(tts_wav))
vc_infer_pipeline.py ADDED
@@ -0,0 +1,165 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import numpy as np,parselmouth,torch,pdb
2
+ from time import time as ttime
3
+ import torch.nn.functional as F
4
+ from config import x_pad,x_query,x_center,x_max
5
+ import scipy.signal as signal
6
+ import pyworld,os,traceback,faiss
7
+ class VC(object):
8
+ def __init__(self,tgt_sr,device,is_half):
9
+ self.sr=16000#hubert输入采样率
10
+ self.window=160#每帧点数
11
+ self.t_pad=self.sr*x_pad#每条前后pad时间
12
+ self.t_pad_tgt=tgt_sr*x_pad
13
+ self.t_pad2=self.t_pad*2
14
+ self.t_query=self.sr*x_query#查询切点前后查询时间
15
+ self.t_center=self.sr*x_center#查询切点位置
16
+ self.t_max=self.sr*x_max#免查询时长阈值
17
+ self.device=device
18
+ self.is_half=is_half
19
+
20
+ def get_f0(self,x, p_len,f0_up_key,f0_method,inp_f0=None):
21
+ time_step = self.window / self.sr * 1000
22
+ f0_min = 50
23
+ f0_max = 1100
24
+ f0_mel_min = 1127 * np.log(1 + f0_min / 700)
25
+ f0_mel_max = 1127 * np.log(1 + f0_max / 700)
26
+ if(f0_method=="pm"):
27
+ f0 = parselmouth.Sound(x, self.sr).to_pitch_ac(
28
+ time_step=time_step / 1000, voicing_threshold=0.6,
29
+ pitch_floor=f0_min, pitch_ceiling=f0_max).selected_array['frequency']
30
+ pad_size=(p_len - len(f0) + 1) // 2
31
+ if(pad_size>0 or p_len - len(f0) - pad_size>0):
32
+ f0 = np.pad(f0,[[pad_size,p_len - len(f0) - pad_size]], mode='constant')
33
+ elif(f0_method=="harvest"):
34
+ f0, t = pyworld.harvest(
35
+ x.astype(np.double),
36
+ fs=self.sr,
37
+ f0_ceil=f0_max,
38
+ f0_floor=f0_min,
39
+ frame_period=10,
40
+ )
41
+ f0 = pyworld.stonemask(x.astype(np.double), f0, t, self.sr)
42
+ f0 = signal.medfilt(f0, 3)
43
+ f0 *= pow(2, f0_up_key / 12)
44
+ # with open("test.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
45
+ tf0=self.sr//self.window#每秒f0点数
46
+ if (inp_f0 is not None):
47
+ delta_t=np.round((inp_f0[:,0].max()-inp_f0[:,0].min())*tf0+1).astype("int16")
48
+ replace_f0=np.interp(list(range(delta_t)), inp_f0[:, 0]*100, inp_f0[:, 1])
49
+ shape=f0[x_pad*tf0:x_pad*tf0+len(replace_f0)].shape[0]
50
+ f0[x_pad*tf0:x_pad*tf0+len(replace_f0)]=replace_f0[:shape]
51
+ # with open("test_opt.txt","w")as f:f.write("\n".join([str(i)for i in f0.tolist()]))
52
+ f0bak = f0.copy()
53
+ f0_mel = 1127 * np.log(1 + f0 / 700)
54
+ f0_mel[f0_mel > 0] = (f0_mel[f0_mel > 0] - f0_mel_min) * 254 / (f0_mel_max - f0_mel_min) + 1
55
+ f0_mel[f0_mel <= 1] = 1
56
+ f0_mel[f0_mel > 255] = 255
57
+ f0_coarse = np.rint(f0_mel).astype(int)
58
+ return f0_coarse, f0bak#1-0
59
+
60
+ def vc(self,model,net_g,sid,audio0,pitch,pitchf,times,index,big_npy,index_rate):#,file_index,file_big_npy
61
+ feats = torch.from_numpy(audio0)
62
+ if(self.is_half):feats=feats.half()
63
+ else:feats=feats.float()
64
+ if feats.dim() == 2: # double channels
65
+ feats = feats.mean(-1)
66
+ assert feats.dim() == 1, feats.dim()
67
+ feats = feats.view(1, -1)
68
+ padding_mask = torch.BoolTensor(feats.shape).to(self.device).fill_(False)
69
+
70
+ inputs = {
71
+ "source": feats.to(self.device),
72
+ "padding_mask": padding_mask,
73
+ "output_layer": 9, # layer 9
74
+ }
75
+ t0 = ttime()
76
+ with torch.no_grad():
77
+ logits = model.extract_features(**inputs)
78
+ feats = model.final_proj(logits[0])
79
+
80
+ if(isinstance(index,type(None))==False and isinstance(big_npy,type(None))==False and index_rate!=0):
81
+ npy = feats[0].cpu().numpy()
82
+ if(self.is_half):npy=npy.astype("float32")
83
+ _, I = index.search(npy, 1)
84
+ npy=big_npy[I.squeeze()]
85
+ if(self.is_half):npy=npy.astype("float16")
86
+ feats = torch.from_numpy(npy).unsqueeze(0).to(self.device)*index_rate + (1-index_rate)*feats
87
+
88
+ feats = F.interpolate(feats.permute(0, 2, 1), scale_factor=2).permute(0, 2, 1)
89
+ t1 = ttime()
90
+ p_len = audio0.shape[0]//self.window
91
+ if(feats.shape[1]<p_len):
92
+ p_len=feats.shape[1]
93
+ if(pitch!=None and pitchf!=None):
94
+ pitch=pitch[:,:p_len]
95
+ pitchf=pitchf[:,:p_len]
96
+ p_len=torch.tensor([p_len],device=self.device).long()
97
+ with torch.no_grad():
98
+ if(pitch!=None and pitchf!=None):
99
+ audio1 = (net_g.infer(feats, p_len, pitch, pitchf, sid)[0][0, 0] * 32768).data.cpu().float().numpy().astype(np.int16)
100
+ else:
101
+ audio1 = (net_g.infer(feats, p_len, sid)[0][0, 0] * 32768).data.cpu().float().numpy().astype(np.int16)
102
+ del feats,p_len,padding_mask
103
+ if torch.cuda.is_available(): torch.cuda.empty_cache()
104
+ t2 = ttime()
105
+ times[0] += (t1 - t0)
106
+ times[2] += (t2 - t1)
107
+ return audio1
108
+
109
+ def pipeline(self,model,net_g,sid,audio,times,f0_up_key,f0_method,file_index,file_big_npy,index_rate,if_f0,f0_file=None):
110
+ if(file_big_npy!=""and file_index!=""and os.path.exists(file_big_npy)==True and os.path.exists(file_index)==True and index_rate!=0):
111
+ try:
112
+ index = faiss.read_index(file_index)
113
+ big_npy = np.load(file_big_npy)
114
+ except:
115
+ traceback.print_exc()
116
+ index=big_npy=None
117
+ else:
118
+ index=big_npy=None
119
+ audio_pad = np.pad(audio, (self.window // 2, self.window // 2), mode='reflect')
120
+ opt_ts = []
121
+ if(audio_pad.shape[0]>self.t_max):
122
+ audio_sum = np.zeros_like(audio)
123
+ for i in range(self.window): audio_sum += audio_pad[i:i - self.window]
124
+ for t in range(self.t_center, audio.shape[0],self.t_center):opt_ts.append(t - self.t_query + np.where(np.abs(audio_sum[t - self.t_query:t + self.t_query]) == np.abs(audio_sum[t - self.t_query:t + self.t_query]).min())[0][0])
125
+ s = 0
126
+ audio_opt=[]
127
+ t=None
128
+ t1=ttime()
129
+ audio_pad = np.pad(audio, (self.t_pad, self.t_pad), mode='reflect')
130
+ p_len=audio_pad.shape[0]//self.window
131
+ inp_f0=None
132
+ if(hasattr(f0_file,'name') ==True):
133
+ try:
134
+ with open(f0_file.name,"r")as f:
135
+ lines=f.read().strip("\n").split("\n")
136
+ inp_f0=[]
137
+ for line in lines:inp_f0.append([float(i)for i in line.split(",")])
138
+ inp_f0=np.array(inp_f0,dtype="float32")
139
+ except:
140
+ traceback.print_exc()
141
+ sid=torch.tensor(sid,device=self.device).unsqueeze(0).long()
142
+ pitch, pitchf=None,None
143
+ if(if_f0==1):
144
+ pitch, pitchf = self.get_f0(audio_pad, p_len, f0_up_key,f0_method,inp_f0)
145
+ pitch = pitch[:p_len]
146
+ pitchf = pitchf[:p_len]
147
+ pitch = torch.tensor(pitch,device=self.device).unsqueeze(0).long()
148
+ pitchf = torch.tensor(pitchf,device=self.device).unsqueeze(0).float()
149
+ t2=ttime()
150
+ times[1] += (t2 - t1)
151
+ for t in opt_ts:
152
+ t=t//self.window*self.window
153
+ if (if_f0 == 1):
154
+ audio_opt.append(self.vc(model,net_g,sid,audio_pad[s:t+self.t_pad2+self.window],pitch[:,s//self.window:(t+self.t_pad2)//self.window],pitchf[:,s//self.window:(t+self.t_pad2)//self.window],times,index,big_npy,index_rate)[self.t_pad_tgt:-self.t_pad_tgt])
155
+ else:
156
+ audio_opt.append(self.vc(model,net_g,sid,audio_pad[s:t+self.t_pad2+self.window],None,None,times,index,big_npy,index_rate)[self.t_pad_tgt:-self.t_pad_tgt])
157
+ s = t
158
+ if (if_f0 == 1):
159
+ audio_opt.append(self.vc(model,net_g,sid,audio_pad[t:],pitch[:,t//self.window:]if t is not None else pitch,pitchf[:,t//self.window:]if t is not None else pitchf,times,index,big_npy,index_rate)[self.t_pad_tgt:-self.t_pad_tgt])
160
+ else:
161
+ audio_opt.append(self.vc(model,net_g,sid,audio_pad[t:],None,None,times,index,big_npy,index_rate)[self.t_pad_tgt:-self.t_pad_tgt])
162
+ audio_opt=np.concatenate(audio_opt)
163
+ del pitch,pitchf,sid
164
+ if torch.cuda.is_available(): torch.cuda.empty_cache()
165
+ return audio_opt