Paper

LLaSA: Scaling Train Time and Test Time Compute for LLaMA based Speech Synthesis (Comming soon)

Getting Started with XCodec2 on Hugging Face

XCodec2 is a speech tokenizer that offers the following key features:

  1. Single Vector Quantization
  2. 50 Tokens per Second
  3. Multilingual Speech Semantic Support and High-Quality Speech Reconstruction

To use xcodec2, ensure you have it installed. You can install it using the following command:

conda create -n xcodec2 python=3.9
conda activate xcodec2
pip install xcodec2==0.1.3 (Fix the bug in the previous version to achieve better sound quality)

Then,

import torch
import soundfile as sf
from transformers import AutoConfig

 
from xcodec2.modeling_xcodec2 import XCodec2Model
 
model_path = "HKUST-Audio/xcodec2"  
 
model = XCodec2Model.from_pretrained(model_path)
model.eval().cuda()   

 
wav, sr = sf.read("test.wav")   
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0)  # Shape: (1, T)

 
with torch.no_grad():
   # Only 16khz speech
   # Only supports single input. For batch inference, please refer to the link below.
    vq_code = model.encode_code(input_waveform=wav_tensor)
    print("Code:", vq_code )  

    recon_wav = model.decode_code(vq_code).cpu()       # Shape: (1, 1, T')

 
sf.write("reconstructed.wav", recon_wav[0, 0, :].numpy(), sr)
print("Done! Check reconstructed.wav")

If you want to train your own xcodec2, batch inference, or large-scale code extraction, the code is released here.

Downloads last month
1,207
Inference Examples
Unable to determine this model's library. Check the docs .