Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation
Abstract
The multi-codebook speech codec enables the application of large language models (LLM) in <PRE_TAG>TTS</POST_TAG> but bottlenecks efficiency and robustness due to <PRE_TAG>multi-sequence prediction</POST_TAG>. To avoid this obstacle, we propose Single-Codec, a single-codebook single-sequence codec, which employs a <PRE_TAG>disentangled VQ-VAE</POST_TAG> to decouple speech into a <PRE_TAG>time-invariant embedding</POST_TAG> and a phonetically-rich discrete sequence. Furthermore, the encoder is enhanced with 1) contextual modeling with a <PRE_TAG>BLSTM module</POST_TAG> to exploit the temporal information, 2) a hybrid sampling module to alleviate distortion from upsampling and downsampling, and 3) a resampling module to encourage discrete units to carry more phonetic information. Compared with multi-codebook codecs, e.g., EnCodec and TiCodec, Single-Codec demonstrates higher reconstruction quality with a lower bandwidth of only 304bps. The effectiveness of Single-Code is further validated by LLM-<PRE_TAG><PRE_TAG>TTS</POST_TAG></POST_TAG> experiments, showing improved naturalness and intelligibility.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper