C²SER-LLM

For more information, please refer to github C2SER

Introduction

As presented in our paper "Steering Language Model to Stable Speech Emotion Recognition via Contextual Perception and Chain of Thought". C²SER employs a CoT training approach to incentivize reasoning capability. This approach decomposes the SER task into sequential steps: first perceiving speech content and speaking style, followed by emotion inference, with the assistance of prior context. This structured method imitates human thinking and reduces the possibility of hallucinations. To further enhance stability and prevent error propagation, especially in longer thought chains, C²SER introduces self-distillation, transferring knowledge from explicit to implicit CoT.

Installation

To install the project dependencies, use the following command:

cd C2SER-llm
pip install -r requirements.txt

Pretrained Model

To run the code, you need to download two files. The first file is Qwen-7B. After downloading, replace the llm_path in ./C2SER-llm/config.yaml with your download path. The second file is the pretrained model [C2SER_model.pt]. After downloading, replace the checkpoint_path in ./C2SER-llm/infer_runtime.py with the path to your downloaded file.

Inference

We provide three input parameters in ./C2SER-llm/infer_runtime.py:

--input_wav_path: Path to the test WAV file.
--ssl_vector_path: Path to the utterance-level feature.
--input_prompt: Prompts for stage1 or stage2

After extracting the utterance-level features of the audio file using Emotion2Vec-S, you need to replace input_wav_path and ssl_vector_path in ./C2SER-llm/infer_runtime.py with the paths to your test audio file and extracted utterance-level features, respectively. You can also control the output of Stage1 and Stage2 by adjusting input_prompt. The prompt information is listed in ./C2SER-llm/prompt_config.yaml. Then, you can directly perform inference by running the following code.

python C2SER-llm/infer_runtime.py

Results

We have provided an example result for the file ./Emotion2Vec-S/test_wav/vo_EQAST002_1_paimon_07.wav

If you use the Stage 1 prompt: Please describe the speaking style, content, and the speaker's emotional state of this speech. ，the output will be:

说话者以缓慢的速度、高昂的语调和中等音量的声音说道：“不知道艾德林小姐有没有给我们准备好吃的点心呢。”通过分析语音特征，推测情绪为快乐，透露出一种期待和兴奋的喜悦。

If you use the Stage 2 prompt: Please consider the speaking style, content, and directly provide the speaker's emotion in this speech. ，the output will be:

这条语音的的情感为高兴

ASLP-lab
/

C2SER-LLM