|
--- |
|
license: cc-by-4.0 |
|
tags: |
|
- audio-to-audio |
|
pipeline_tag: audio-to-audio |
|
--- |
|
|
|
|
|
## Paper |
|
LLaSA: Scaling Train Time and Test Time Compute for LLaMA based Speech Synthesis (Comming soon) |
|
|
|
|
|
|
|
# Getting Started with XCodec2 on Hugging Face |
|
XCodec2 is a speech tokenizer that offers the following key features: |
|
|
|
1. **Single Vector Quantization** |
|
2. **50 Tokens per Second** |
|
3. **Multilingual Speech Semantic Support and High-Quality Speech Reconstruction** |
|
|
|
|
|
To use `xcodec2`, ensure you have it installed. You can install it using the following command: |
|
|
|
```bash |
|
conda create -n xcodec2 python=3.9 |
|
conda activate xcodec2 |
|
pip install xcodec2==0.1.3 (Fix the bug in the previous version to achieve better sound quality) |
|
``` |
|
Then, |
|
```python |
|
import torch |
|
import soundfile as sf |
|
from transformers import AutoConfig |
|
|
|
|
|
from xcodec2.modeling_xcodec2 import XCodec2Model |
|
|
|
model_path = "HKUST-Audio/xcodec2" |
|
|
|
model = XCodec2Model.from_pretrained(model_path) |
|
model.eval().cuda() |
|
|
|
|
|
wav, sr = sf.read("test.wav") |
|
wav_tensor = torch.from_numpy(wav).float().unsqueeze(0) # Shape: (1, T) |
|
|
|
|
|
with torch.no_grad(): |
|
# Only 16khz speech |
|
# Only supports single input. For batch inference, please refer to the link below. |
|
vq_code = model.encode_code(input_waveform=wav_tensor) |
|
print("Code:", vq_code ) |
|
|
|
recon_wav = model.decode_code(vq_code).cpu() # Shape: (1, 1, T') |
|
|
|
|
|
sf.write("reconstructed.wav", recon_wav[0, 0, :].numpy(), sr) |
|
print("Done! Check reconstructed.wav") |
|
``` |
|
|
|
# If you want to train your own xcodec2, batch inference, or large-scale code extraction, the code is released [here](https://github.com/zhenye234/X-Codec-2.0). |