baichuan-inc/Baichuan-Audio-Instruct

Open-Source End-to-End Speech Interaction Foundation Model

Baichuan-Audio 🤗 | Baichuan-Audio-Base 🤗 | Technical Report 📖

OpenAudioBench 🤗 | Training Data 🤗 (Coming Soon)

Model Architecture

Baichuan-Auido mainly consists of Baichuan-Audio Tokenizer, Audio LLM, and Flow-matching based Audio Decoder. First, speech is converted into discrete audio tokens by the Baichuan-Audio Tokenizer. Then, Audio LLM generates aligned text and audio tokens in an interleaved manner, achieving seamless modality switching between text and audio through special tokens. Audio tokens are processed by an independent audio head and reconstructed into high-quality Mel spectrograms using a flow-matching based audio decoder, which are then converted into audio waveforms via a vocoder.

Baichuan-Audio-Tokenizer uses a 12.5hz frame rate design. It employs Whisper Large Encoder to extract high-level audio features from Mel spectrograms, then uses 8-layer RVQ to minimize information loss during quantization. To capture both semantic and acoustic information, we use Mel spectrogram reconstruction and Pre-trained LLM for acoustic and semantic supervision, respectively.
Audio LLM generates aligned text and audio tokens in an interleaved manner, achieving seamless switching between text and audio modalities through special tokens. Audio tokens are processed by an independent audio head.
Flow-matching based Audio Decoder is used to reconstruct high-quality Mel spectrograms. The model is trained on 24 kHz audio to generate target Mel spectrograms, which are then converted into audio waveforms via a vocoder.

Pre-training details

Pre-training data

Audio training data can be broadly divided into two main types: audio understanding data and audio generation data.

Audio-text paired data (e.g., ASR and TTS data) improves performance on basic speech tasks. On the other hand, pure audio data enhances the ability to handle audio modalities independently. Audio-Text Interleaved data consists of alternating text and audio modalities, segmented by punctuation to facilitate cross-modal knowledge transfer. Interleaved Text-to-Speech data consists of fully aligned text and audio content, aimed at enhancing the model's ability to generate audio tokens under text supervision.

The interleaved data collection process is divided into crawling and synthesis types, resulting in a total of 142k hours of ITTS data and 393k hours of INTLV data.

Two stage training strategy

The conflict between speech and text modalities may interfere with the pre-trained text knowledge representation in pre-trained LLMs, leading to degradation in model intelligence performance. To mitigate this, we adopt a two-stage training strategy. In the first stage, the LLM parameters remain fixed, and only the audio embedding layer and audio head parameters are updated. In the second stage, all parameters except the LM embedding layer and LM head parameters are trained.

Open-Source Evaluation Set

OpenAudioBench

To more efficiently evaluate the model's "intelligence," we have constructed OpenAudioBench, which includes 5 sub-evaluation sets for end-to-end audio understanding. These include 4 public evaluation sets (llama question, WEB QA, TriviaQA, AlpacaEval) and a speech logical reasoning evaluation set built by the Baichuan team, totaling 2701 data points. This comprehensive set reflects the model's "intelligence" level.

Model performance

Acknowledgments

Automatic Speech Recognition (ASR) Model: 【Whisper】(https://github.com/openai/whisper)
Large Language Model (LLM): 【Qwen2.5 7B】(https://arxiv.org/abs/2412.15115)
Partial code from: CosyVoice and Matcha-TTS: (https://github.com/FunAudioLLM/CosyVoice, https://github.com/shivammehta25/Matcha-TTS/)
HiFi-GAN Vocoder from CosyVoice 2.0: (https://funaudiollm.github.io/cosyvoice2/)

License

The use of Baichuan-Audio-Base/Baichuan-Audio model weights must comply with the Apache 2.0